[jira] [Commented] (SQOOP-1803) JobManager and Execution Engine changes: Support for a injecting and pulling out configs and job output in connectors

Jarek Jarcec Cecho (JIRA) Mon, 16 Mar 2015 07:40:35 -0700

    [ 
https://issues.apache.org/jira/browse/SQOOP-1803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14363273#comment-14363273
 ]


Jarek Jarcec Cecho commented on SQOOP-1803:
-------------------------------------------

Thank you for putting it together [~vybs].

Indeed the current {{MutableContext}} serializes all the data as Strings, but 
that is just an internal way that is modeling what Hadoop's {{Configuration}} 
has been doing. We're still exposing {{setBoolean}}, {{setInt}}, ... methods 
and their {{getType}} alternatives. Connector developer can store any type in 
{{Context}}. It's however his responsibility to remember what type has been 
stored there (e.g. we do not persist information that property "X" has been 
saved as long). The {{MutableContext}} is not persisted in our repository and 
is more meant as transient store specific to given submission. I believe that 
the context is fully lost after the submission ends. Hence I think that we 
should have a contract in the connector API somewhere that given the context 
object can update the appropriate configuration. Couple of ideas:

1) We currently call the 
{{[Initializer.initialize()|https://github.com/apache/sqoop/blob/sqoop2/connector/connector-sdk/src/main/java/org/apache/sqoop/job/etl/Initializer.java#L47]}}
 on every job initialization (both in From and To context). We could allow the 
connector to change the given configuration objects. If and only if the job is 
successful we would persist the updated configuration objects in repository via 
normal update path (the same one that is used by user). As the job submission 
is asynchronous we might need to came up with mechanism how to persist the 
updated configuration objects with the Hadoop job itself and get the back later.

*Pros:* Seems relatively simple to implement as we are already preserving a lot 
of information with the Hadoop job itself.
*Cons:* We would introduce kind of "implied" or "secret" API as the connector 
developer have to know that he is allowed to change the configuration objects.

2) Alternatively we could expose and explicit API 
{{updateConfigurationObjects(Context, LinkConfiguration, JobConfiguration)}} 
(proper name pending) that connector developer could explicitly implement if he 
cares about updating the configuration objects. As this API would make sense 
only after the job is successfully finished, we could:

*Pros:* We have an explicit API with nicely defined semantics. We don't need to 
persist any additional information in the Hadoop job object.

2.1) Introduce it as part of 
[Destroyer|https://github.com/apache/sqoop/blob/sqoop2/connector/connector-sdk/src/main/java/org/apache/sqoop/job/etl/Destroyer.java]

*Pros:* Updating the configuration objects is part of the clean up phase so it 
make sense to have it as part of {{Destroyer}}.
*Cons:* Currently the {{Destroyer}} runs outside of the Sqoop 2 server 
somewhere on the cluster. We would either have to move the {{Destroyer}} to be 
executed in the server or simply call this particular method in different 
instance of the {{Destroyer}} - and that might be a bit confusing.

2.2) Introduce a new part of the workflow that will be executed post Destroyer. 
Something liked {{Updater}}.

*Pros:*  We can easily run it on Sqoop 2 server itself without moving/caring 
about where the {{Destroyer}} runs.
*Cons:* Seems weird to have a part of workflow that is executed post finalized 
step. Especially when it have the same semantics as the {{Destroyer}} (we will 
call it exactly once, on one node).

I'm sure that there are other ways how to expose the contract in the connector 
interface, so don't hesitate and jump in with other ideas!

> JobManager and Execution Engine changes: Support for a injecting and pulling 
> out configs and job output in connectors 
> ----------------------------------------------------------------------------------------------------------------------
>
>                 Key: SQOOP-1803
>                 URL: https://issues.apache.org/jira/browse/SQOOP-1803
>             Project: Sqoop
>          Issue Type: Sub-task
>            Reporter: Veena Basavaraj
>            Assignee: Veena Basavaraj
>             Fix For: 1.99.6
>
>
> The details are in the design wiki, as the implementation happens more 
> discussions can happen here.
> https://cwiki.apache.org/confluence/display/SQOOP/Delta+Fetch+And+Merge+Design#DeltaFetchAndMergeDesign-Howtogetoutputfromconnectortosqoop?
> The goal is to dynamically inject a IncrementalConfig instance into the 
> FromJobConfiguration. The current MFromConfig and MToConfig can already hold 
> a list of configs, and a strong sentiment was expressed to keep it as a list, 
> why not for the first time actually make use of it and group the incremental 
> related configs in one config object
> This task will prepare the FromJobConfiguration from the job config data, 
> ExtractorContext with the relevant values from the prev job run 
> This task will prepare the ToJobConfiguration from the job config data, 
> LoaderContext with the relevant values from the prev job run if any
> We will use DistributedCache to get State information from the Extractor and 
> Loader out and finally persist it into the sqoop repository depending on 
> SQOOP-1804 once the outputcommitter commit is called



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (SQOOP-1803) JobManager and Execution Engine changes: Support for a injecting and pulling out configs and job output in connectors

Reply via email to