[
https://issues.apache.org/jira/browse/SQOOP-1168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223886#comment-14223886
]
Veena Basavaraj edited comment on SQOOP-1168 at 11/25/14 4:00 AM:
------------------------------------------------------------------
[~vinothchandar] Hey, spent a fair amount of time digging up Sqoop1 and
understanding what it did for the writes, It was much simpler feature set n
Sqoop1 since it was SQL to Hadoop. In case of Sqoop2 it can be anything to
anything.! I have created a bunch of tasks and will put up a design wiki on
https://cwiki.apache.org/confluence/display/SQOOP/Home once I clean it up a bit
was (Author: vybs):
[~vinothchandar] Hey, spent a fair amount of time digging up Sqoop1 and
understanding what it did for the writes, It was much simpler feature set n
Sqoop1 since it was SQL to Hadoop. In case of Sqoop1 it can be anything to
anything.! I have created a bunch of tasks and will put up a design wiki on
https://cwiki.apache.org/confluence/display/SQOOP/Home once I clean it up a bit
> Sqoop2: Incremental From/To ( formerly called Incremental Import )
> ------------------------------------------------------------------
>
> Key: SQOOP-1168
> URL: https://issues.apache.org/jira/browse/SQOOP-1168
> Project: Sqoop
> Issue Type: Bug
> Reporter: Hari Shreedharan
> Assignee: Veena Basavaraj
> Fix For: 1.99.5
>
>
> Initial plan is to follow roughly the same design as Sqoop 1, except provide
> pluggability to start this through a REST API.
> This comment applies to every sub task
> {code}
> WIP ( so do not consider this as a final design)
> {code}
> Relevant code in Sqoop 1 ( to ensure parity with tests and features ). Also
> note that Sqoop 1 was less generic than Sqoop2 in terms of the from and to.
> The From was a SQL and TO was HDFS/ Hive...so this is no longer true in
> SQOOP2. It cab be from HDFS to SQL or from MongoDB to MySQL.
> https://github.com/apache/sqoop/blob/trunk/src/java/org/apache/sqoop/tool/ImportTool.java
> https://github.com/apache/sqoop/blob/trunk/src/java/org/apache/sqoop/mapreduce/MergeJob.java
> https://github.com/apache/sqoop/blob/trunk/src/java/org/apache/sqoop/util/AppendUtils.java
> https://github.com/apache/sqoop/blob/trunk/src/test/com/cloudera/sqoop/TestAppendUtils.java
> https://github.com/apache/sqoop/blob/trunk/src/test/org/apache/sqoop/manager/oracle/OracleIncrementalImportTest.java
> https://github.com/apache/sqoop/blob/trunk/src/test/com/cloudera/sqoop/TestIncrementalImport.java
> Some things I considered.
> 1. Do we need the 2 modes,simpler sequential / append case based on a simple
> predicate ( that can be the default) and more generic random set of records (
> based on a complex predicate ). The only reason these 2 modes might make
> sense, is the simple append use case can be for majority of the part be
> handled by the sqoop. There is very little logic to add per connector.
> The more complex case where we can have random selection of records to read
> and even write them out in many different ways requires a custom logic. We
> can certainly provide some common implementation of doing such updates, such
> as writing a new dataset for the delta records and the merging duplicate row
> data sets based on the last updated value.
> 2. What are the phases of the Incremental read/ write process for each
> connector look like?. Should this be independent stage in the job lifecycle
> complementing the Extractor or another api on the Extractor to do
> deltaExtract or just another config object on the FromJobConfiguration where
> the computed predicate string ( combination of raw predicate and the last
> value from the FromConfig of the job)?
> Seems like just a field on the FromJobConfiguration suffices. It can be used
> both in the Partitioner and Extractor.
> The FromJobConfiguration can have a dynamically created config object in its
> list that is reserved for all the incremental related fields. One of the
> field in it will be the following predicate.
> Say one way of supporting predicate is via a proper schema like this.
> {code}
> {
> type : "append"
> columns : "id"
> operator : ">="
> value : "34"
> }
> {code}
> The FromJobConfiguration might have the actual query string WHERE id > =
> '34' in the string format. Alternatively. If we do not go with the predicates
> and use loose fields like in Sqoop1, then the FromJobConfiguration will hold
> these fields and its values. The FromConfig will hold the last processed
> value, so that it can be used in the subsequent runs to indicate from where
> to start.
> 3. Similarly in terms of Writing /Loading, the ToJobConfiguration will have
> information in it to indicate if this was a delta write. Based on the
> predicate type/ modes ( SEQUENTIAL( APPEND) / RANDOM) , the loader can
> decide how to write the data.
> 4. Does storing the last-processed-value in the job submission make it
> easier? in that case we would need 2 fields, the last_read_value for the FROM
> side and the last_written_value for the TO. I would consider submission,
> since these values are a result of job run...but still store predicate in the
> job config ( with the given last value )
> 5. Even the last step, very inaptly names Destroyer, after successful might
> need this info to finally record the values in the submission / configs and
> other related stats of the incremental read/ writes ..So the FROM and the TO
> destroyer will need to do this part.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)