[jira] [Commented] (SQOOP-1168) Sqoop2: Delta Fetch/ Merge ( formerly called Incremental Import )

Gwen Shapira (JIRA) Tue, 16 Dec 2014 11:02:50 -0800

    [ 
https://issues.apache.org/jira/browse/SQOOP-1168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14248693#comment-14248693
 ]


Gwen Shapira commented on SQOOP-1168:
-------------------------------------

Thanks for putting the design together. I reviewed and overall it looks great. 
I do have some questions and comments:

* Terminology:
Can we add "predicate" to the terminology? "encapsulates the information for 
reading and writing subset or all records from the data source" is not a 
standard definition of a predicate and was a bit confusing to me.

* User facing: 
I think the decision whether delta will be used should be in the job 
definition, not when starting job (i.e agree with [~vinothchandar]). First, 
because it matches Sqoop1 behavior, and second because running both incremental 
and non-incremental from same job will make it more difficult to store and 
retrieve the “what did I fetch last time” part of the information.  Its not 
even 100% clear what is the correct behavior in this case.

* Connector API:  
Connectors seem to need to give a lot of information about capabilities, and 
this may keep growing. In addition to supporting incrementals, TO will need to 
say whether they support updates or just appends. 
getSupportedDirections is nice, but perhaps instead of adding more such methods 
to API, we can use annotations and later introspect them to see what the 
connector supports? Another option is to add interfaces: MongoDBConnector 
implements SqoopConnector extends incremental… but this can get pretty large 
too.

* Predicate API: 
You description covers the information on how connectors will use the 
predicate, but I think the framework may need it to store and provide “what did 
I fetch last time” values.
In general, I’d prefer a simpler and perhaps less generic API here. A way for 
connectors to store “last fetched value” in the framework (date for hdfs, 
offsets for kafka, column values for jdbc) and a way to use it. 

* Repository: Job inputs can work, but I think submission makes more sense 
because it is built to store history from pervious job executions

*  REST API - yep, definitely.

Few comments on  open questions:
* Some connectors only have the concepts of append/incremental - Kafka can’t 
overwrite existing values in a topic. HDFS makes updates challenging and 
knowing which records were updated near impossible. I think exposing the 
difference to users will be less surprising.
* JSON in the command line will be ugly… I hope for a better way to do things.
* +1 for SQ_JOB_SUBMISSION, I think submission is generic enough to not require 
schema changes.


> Sqoop2: Delta Fetch/ Merge ( formerly called Incremental Import )
> -----------------------------------------------------------------
>
>                 Key: SQOOP-1168
>                 URL: https://issues.apache.org/jira/browse/SQOOP-1168
>             Project: Sqoop
>          Issue Type: Bug
>            Reporter: Hari Shreedharan
>            Assignee: Veena Basavaraj
>             Fix For: 1.99.5
>
>
> The formal design wiki is here 
> https://cwiki.apache.org/confluence/display/SQOOP/Delta+Fetch+and+Merge+Design



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (SQOOP-1168) Sqoop2: Delta Fetch/ Merge ( formerly called Incremental Import )

Reply via email to