[ 
https://issues.apache.org/jira/browse/SQOOP-1168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223886#comment-14223886
 ] 

Veena Basavaraj edited comment on SQOOP-1168 at 11/25/14 4:00 AM:
------------------------------------------------------------------

[~vinothchandar] Hey, spent a fair amount of time digging up Sqoop1 and 
understanding what it did for the writes, It was much simpler feature set n 
Sqoop1 since it was SQL to Hadoop. In case of Sqoop2 it can be anything to 
anything.! I have created a bunch of tasks and will put up a design wiki on 
https://cwiki.apache.org/confluence/display/SQOOP/Home once I clean it up a bit


was (Author: vybs):
[~vinothchandar] Hey, spent a fair amount of time digging up Sqoop1 and 
understanding what it did for the writes, It was much simpler feature set n 
Sqoop1 since it was SQL to Hadoop. In case of Sqoop1 it can be anything to 
anything.! I have created a bunch of tasks and will put up a design wiki on 
https://cwiki.apache.org/confluence/display/SQOOP/Home once I clean it up a bit

> Sqoop2: Incremental From/To ( formerly called Incremental Import )
> ------------------------------------------------------------------
>
>                 Key: SQOOP-1168
>                 URL: https://issues.apache.org/jira/browse/SQOOP-1168
>             Project: Sqoop
>          Issue Type: Bug
>            Reporter: Hari Shreedharan
>            Assignee: Veena Basavaraj
>             Fix For: 1.99.5
>
>
> Initial plan is to follow roughly the same design as Sqoop 1, except provide 
> pluggability to start this through a REST API.
> This comment applies to every sub task
> {code}
> WIP ( so do not consider this as a final design)
> {code}
> Relevant code in Sqoop 1 ( to ensure parity with tests and features ). Also 
> note that  Sqoop 1 was less generic than Sqoop2 in terms of the from and to. 
> The From was a SQL and TO was HDFS/ Hive...so this is no longer true in 
> SQOOP2. It cab be from HDFS to SQL or from MongoDB to MySQL.
> https://github.com/apache/sqoop/blob/trunk/src/java/org/apache/sqoop/tool/ImportTool.java
> https://github.com/apache/sqoop/blob/trunk/src/java/org/apache/sqoop/mapreduce/MergeJob.java
> https://github.com/apache/sqoop/blob/trunk/src/java/org/apache/sqoop/util/AppendUtils.java
> https://github.com/apache/sqoop/blob/trunk/src/test/com/cloudera/sqoop/TestAppendUtils.java
> https://github.com/apache/sqoop/blob/trunk/src/test/org/apache/sqoop/manager/oracle/OracleIncrementalImportTest.java
> https://github.com/apache/sqoop/blob/trunk/src/test/com/cloudera/sqoop/TestIncrementalImport.java
> Some things I considered.
> 1. Do we need the 2 modes,simpler sequential / append case based on a simple 
> predicate ( that can be the default) and more generic random set of records ( 
> based on a complex predicate ). The only reason these 2 modes might make 
> sense,  is the simple append use case can be for majority of the part be 
> handled by the sqoop. There is very little logic to add per connector.
> The more complex case where we can have random selection of records to read 
> and even write them out in many different ways requires a custom logic. We 
> can certainly provide some common implementation of doing such updates, such 
> as writing a new dataset for the delta records and the merging duplicate row 
> data sets based on the last updated value.
> 2. What are the phases of the Incremental read/ write process for each 
> connector look like?. Should this be independent stage in the job lifecycle 
> complementing the Extractor or another api on the Extractor to do 
> deltaExtract or just another config object on the FromJobConfiguration where 
> the computed predicate string ( combination of raw predicate and the last 
> value from the FromConfig of the job)?
> Seems like just a field on the FromJobConfiguration suffices. It can be used 
> both in the Partitioner and Extractor.
> The FromJobConfiguration  can have a dynamically created config object in its 
> list that is reserved for all the incremental related fields. One of the 
> field in it will be the following predicate.
> Say one way of supporting predicate is via a proper schema like this. 
> {code}
> { 
>  type : "append"
>  columns : "id"
>  operator : ">="
>  value : "34"
> }
> {code}
> The FromJobConfiguration might have  the actual query string WHERE id > = 
> '34' in the string format. Alternatively. If we do not go with the predicates 
> and use loose fields like in Sqoop1, then the FromJobConfiguration will hold 
> these fields and its values.  The FromConfig will hold the last processed 
> value, so that it can be used in the subsequent runs to indicate from where 
> to start.
> 3. Similarly in terms of Writing /Loading, the ToJobConfiguration will have 
> information in it to indicate if this was a delta write. Based on the 
> predicate type/ modes ( SEQUENTIAL( APPEND)  / RANDOM) , the loader can 
> decide how to write the data. 
> 4. Does storing the last-processed-value in the job submission make it 
> easier? in that case we would need 2 fields, the last_read_value for the FROM 
> side and the last_written_value for the TO. I would consider submission, 
> since these values are a result of job run...but still store predicate in the 
> job config ( with the given last value )
> 5. Even the last step, very inaptly names Destroyer, after successful might 
> need this info to finally record the values in the submission / configs and 
> other related stats of the incremental read/ writes ..So the FROM and the TO 
> destroyer will need to do this part.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to