[ 
https://issues.apache.org/jira/browse/SQOOP-1168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Veena Basavaraj updated SQOOP-1168:
-----------------------------------
    Description: 
The formal design wiki is here 
https://cwiki.apache.org/confluence/display/SQOOP/Incremental+From+To+Design



Following is some notes while investigating the design 

Initial plan is to follow roughly the same design as Sqoop 1, except provide 
pluggability to start this through a REST API.

Relevant code in Sqoop 1 ( to ensure parity with tests and features ). Also 
note that  Sqoop 1 was less generic than Sqoop2 in terms of the from and to. 
The From was a SQL and TO was HDFS/ Hive...so this is no longer true in SQOOP2. 
It cab be from HDFS to SQL or from MongoDB to MySQL.


https://github.com/apache/sqoop/blob/trunk/src/java/org/apache/sqoop/tool/ImportTool.java
https://github.com/apache/sqoop/blob/trunk/src/java/org/apache/sqoop/mapreduce/MergeJob.java

https://github.com/apache/sqoop/blob/trunk/src/java/org/apache/sqoop/util/AppendUtils.java
https://github.com/apache/sqoop/blob/trunk/src/test/com/cloudera/sqoop/TestAppendUtils.java

https://github.com/apache/sqoop/blob/trunk/src/test/org/apache/sqoop/manager/oracle/OracleIncrementalImportTest.java
https://github.com/apache/sqoop/blob/trunk/src/test/com/cloudera/sqoop/TestIncrementalImport.java

Some things I considered.

1. Do we need the 2 modes,simpler sequential / append case based on a simple 
predicate ( that can be the default) and more generic random set of records ( 
based on a complex predicate ). The only reason these 2 modes might make sense, 
 is the simple append use case can be for majority of the part be handled by 
the sqoop. There is very little logic to add per connector.

The more complex case where we can have random selection of records to read and 
even write them out in many different ways requires a custom logic. We can 
certainly provide some common implementation of doing such updates, such as 
writing a new dataset for the delta records and the merging duplicate row data 
sets based on the last updated value.

2. What are the phases of the Incremental read/ write process for each 
connector look like?. Should this be independent stage in the job lifecycle 
complementing the Extractor or another api on the Extractor to do deltaExtract 
or just another config object on the FromJobConfiguration where the computed 
predicate string ( combination of raw predicate and the last value from the 
FromConfig of the job)?
Seems like just a field on the FromJobConfiguration suffices. It can be used 
both in the Partitioner and Extractor.

The FromJobConfiguration  can have a dynamically created config object in its 
list that is reserved for all the incremental related fields. One of the field 
in it will be the following predicate.

Say one way of supporting predicate is via a proper schema like this. 
{code}
{ 
 type : "append"
 columns : "id"
 operator : ">="
 value : "34"
}
{code}

The FromJobConfiguration might have  the actual query string WHERE id > = '34' 
in the string format. Alternatively. If we do not go with the predicates and 
use loose fields like in Sqoop1, then the FromJobConfiguration will hold these 
fields and its values.  The FromConfig will hold the last processed value, so 
that it can be used in the subsequent runs to indicate from where to start.

3. Similarly in terms of Writing /Loading, the ToJobConfiguration will have 
information in it to indicate if this was a delta write. Based on the predicate 
type/ modes ( SEQUENTIAL( APPEND)  / RANDOM) , the loader can decide how to 
write the data. 

4. Does storing the last-processed-value in the job submission make it easier? 
in that case we would need 2 fields, the last_read_value for the FROM side and 
the last_written_value for the TO. I would consider submission, since these 
values are a result of job run...but still store predicate in the job config ( 
with the given last value )

5. Even the last step, very inaptly names Destroyer, after successful might 
need this info to finally record the values in the submission / configs and 
other related stats of the incremental read/ writes ..So the FROM and the TO 
destroyer will need to do this part.


  was:

The formal design wiki is here 
https://cwiki.apache.org/confluence/display/SQOOP/Incremental+From+To+Design



Following is some notes while investigating the design 

Initial plan is to follow roughly the same design as Sqoop 1, except provide 
pluggability to start this through a REST API.

Relevant code in Sqoop 1 ( to ensure parity with tests and features ). Also 
note that  Sqoop 1 was less generic than Sqoop2 in terms of the from and to. 
The From was a SQL and TO was HDFS/ Hive...so this is no longer true in SQOOP2. 
It cab be from HDFS to SQL or from MongoDB to MySQL.


https://github.com/apache/sqoop/blob/trunk/src/java/org/apache/sqoop/tool/ImportTool.java
https://github.com/apache/sqoop/blob/trunk/src/java/org/apache/sqoop/mapreduce/MergeJob.java

https://github.com/apache/sqoop/blob/trunk/src/java/org/apache/sqoop/util/AppendUtils.java
https://github.com/apache/sqoop/blob/trunk/src/test/com/cloudera/sqoop/TestAppendUtils.java

https://github.com/apache/sqoop/blob/trunk/src/test/org/apache/sqoop/manager/oracle/OracleIncrementalImportTest.java
https://github.com/apache/sqoop/blob/trunk/src/test/com/cloudera/sqoop/TestIncrementalImport.java

Some things I considered.

1. Do we need the 2 modes,simpler sequential / append case based on a simple 
predicate ( that can be the default) and more generic random set of records ( 
based on a complex predicate ). The only reason these 2 modes might make sense, 
 is the simple append use case can be for majority of the part be handled by 
the sqoop. There is very little logic to add per connector.

The more complex case where we can have random selection of records to read and 
even write them out in many different ways requires a custom logic. We can 
certainly provide some common implementation of doing such updates, such as 
writing a new dataset for the delta records and the merging duplicate row data 
sets based on the last updated value.

2. What are the phases of the Incremental read/ write process for each 
connector look like?. Should this be independent stage in the job lifecycle 
complementing the Extractor or another api on the Extractor to do deltaExtract 
or just another config object on the FromJobConfiguration where the computed 
predicate string ( combination of raw predicate and the last value from the 
FromConfig of the job)?
Seems like just a field on the FromJobConfiguration suffices. It can be used 
both in the Partitioner and Extractor.

The FromJobConfiguration  can have a dynamically created config object in its 
list that is reserved for all the incremental related fields. One of the field 
in it will be the following predicate.

Say one way of supporting predicate is via a proper schema like this. 
{code}
{ 
 type : "append"
 columns : "id"
 operator : ">="
 value : "34"
}
{code}

The FromJobConfiguration might have  the actual query string WHERE id > = '34' 
in the string format. Alternatively. If we do not go with the predicates and 
use loose fields like in Sqoop1, then the FromJobConfiguration will hold these 
fields and its values.  The FromConfig will hold the last processed value, so 
that it can be used in the subsequent runs to indicate from where to start.

3. Similarly in terms of Writing /Loading, the ToJobConfiguration will have 
information in it to indicate if this was a delta write. Based on the predicate 
type/ modes ( SEQUENTIAL( APPEND)  / RANDOM) , the loader can decide how to 
write the data. 

4. Does storing the last-processed-value in the job submission make it easier? 
in that case we would need 2 fields, the last_read_value for the FROM side and 
the last_written_value for the TO. I would consider submission, since these 
values are a result of job run...but still store predicate in the job config ( 
with the given last value )

5. Even the last step, very inaptly names Destroyer, after successful might 
need this info to finally record the values in the submission / configs and 
other related stats of the incremental read/ writes ..So the FROM and the TO 
destroyer will need to do this part.


        Summary: Sqoop2: Incremental and Delta updates ( formerly called 
Incremental Import )  (was: Sqoop2: Incremental From/To ( formerly called 
Incremental Import ))

> Sqoop2: Incremental and Delta updates ( formerly called Incremental Import )
> ----------------------------------------------------------------------------
>
>                 Key: SQOOP-1168
>                 URL: https://issues.apache.org/jira/browse/SQOOP-1168
>             Project: Sqoop
>          Issue Type: Bug
>            Reporter: Hari Shreedharan
>            Assignee: Veena Basavaraj
>             Fix For: 1.99.5
>
>
> The formal design wiki is here 
> https://cwiki.apache.org/confluence/display/SQOOP/Incremental+From+To+Design
> Following is some notes while investigating the design 
> Initial plan is to follow roughly the same design as Sqoop 1, except provide 
> pluggability to start this through a REST API.
> Relevant code in Sqoop 1 ( to ensure parity with tests and features ). Also 
> note that  Sqoop 1 was less generic than Sqoop2 in terms of the from and to. 
> The From was a SQL and TO was HDFS/ Hive...so this is no longer true in 
> SQOOP2. It cab be from HDFS to SQL or from MongoDB to MySQL.
> https://github.com/apache/sqoop/blob/trunk/src/java/org/apache/sqoop/tool/ImportTool.java
> https://github.com/apache/sqoop/blob/trunk/src/java/org/apache/sqoop/mapreduce/MergeJob.java
> https://github.com/apache/sqoop/blob/trunk/src/java/org/apache/sqoop/util/AppendUtils.java
> https://github.com/apache/sqoop/blob/trunk/src/test/com/cloudera/sqoop/TestAppendUtils.java
> https://github.com/apache/sqoop/blob/trunk/src/test/org/apache/sqoop/manager/oracle/OracleIncrementalImportTest.java
> https://github.com/apache/sqoop/blob/trunk/src/test/com/cloudera/sqoop/TestIncrementalImport.java
> Some things I considered.
> 1. Do we need the 2 modes,simpler sequential / append case based on a simple 
> predicate ( that can be the default) and more generic random set of records ( 
> based on a complex predicate ). The only reason these 2 modes might make 
> sense,  is the simple append use case can be for majority of the part be 
> handled by the sqoop. There is very little logic to add per connector.
> The more complex case where we can have random selection of records to read 
> and even write them out in many different ways requires a custom logic. We 
> can certainly provide some common implementation of doing such updates, such 
> as writing a new dataset for the delta records and the merging duplicate row 
> data sets based on the last updated value.
> 2. What are the phases of the Incremental read/ write process for each 
> connector look like?. Should this be independent stage in the job lifecycle 
> complementing the Extractor or another api on the Extractor to do 
> deltaExtract or just another config object on the FromJobConfiguration where 
> the computed predicate string ( combination of raw predicate and the last 
> value from the FromConfig of the job)?
> Seems like just a field on the FromJobConfiguration suffices. It can be used 
> both in the Partitioner and Extractor.
> The FromJobConfiguration  can have a dynamically created config object in its 
> list that is reserved for all the incremental related fields. One of the 
> field in it will be the following predicate.
> Say one way of supporting predicate is via a proper schema like this. 
> {code}
> { 
>  type : "append"
>  columns : "id"
>  operator : ">="
>  value : "34"
> }
> {code}
> The FromJobConfiguration might have  the actual query string WHERE id > = 
> '34' in the string format. Alternatively. If we do not go with the predicates 
> and use loose fields like in Sqoop1, then the FromJobConfiguration will hold 
> these fields and its values.  The FromConfig will hold the last processed 
> value, so that it can be used in the subsequent runs to indicate from where 
> to start.
> 3. Similarly in terms of Writing /Loading, the ToJobConfiguration will have 
> information in it to indicate if this was a delta write. Based on the 
> predicate type/ modes ( SEQUENTIAL( APPEND)  / RANDOM) , the loader can 
> decide how to write the data. 
> 4. Does storing the last-processed-value in the job submission make it 
> easier? in that case we would need 2 fields, the last_read_value for the FROM 
> side and the last_written_value for the TO. I would consider submission, 
> since these values are a result of job run...but still store predicate in the 
> job config ( with the given last value )
> 5. Even the last step, very inaptly names Destroyer, after successful might 
> need this info to finally record the values in the submission / configs and 
> other related stats of the incremental read/ writes ..So the FROM and the TO 
> destroyer will need to do this part.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to