[
https://issues.apache.org/jira/browse/SQOOP-1168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14248693#comment-14248693
]
Gwen Shapira commented on SQOOP-1168:
-------------------------------------
Thanks for putting the design together. I reviewed and overall it looks great.
I do have some questions and comments:
* Terminology:
Can we add "predicate" to the terminology? "encapsulates the information for
reading and writing subset or all records from the data source" is not a
standard definition of a predicate and was a bit confusing to me.
* User facing:
I think the decision whether delta will be used should be in the job
definition, not when starting job (i.e agree with [~vinothchandar]). First,
because it matches Sqoop1 behavior, and second because running both incremental
and non-incremental from same job will make it more difficult to store and
retrieve the “what did I fetch last time” part of the information. Its not
even 100% clear what is the correct behavior in this case.
* Connector API:
Connectors seem to need to give a lot of information about capabilities, and
this may keep growing. In addition to supporting incrementals, TO will need to
say whether they support updates or just appends.
getSupportedDirections is nice, but perhaps instead of adding more such methods
to API, we can use annotations and later introspect them to see what the
connector supports? Another option is to add interfaces: MongoDBConnector
implements SqoopConnector extends incremental… but this can get pretty large
too.
* Predicate API:
You description covers the information on how connectors will use the
predicate, but I think the framework may need it to store and provide “what did
I fetch last time” values.
In general, I’d prefer a simpler and perhaps less generic API here. A way for
connectors to store “last fetched value” in the framework (date for hdfs,
offsets for kafka, column values for jdbc) and a way to use it.
* Repository: Job inputs can work, but I think submission makes more sense
because it is built to store history from pervious job executions
* REST API - yep, definitely.
Few comments on open questions:
* Some connectors only have the concepts of append/incremental - Kafka can’t
overwrite existing values in a topic. HDFS makes updates challenging and
knowing which records were updated near impossible. I think exposing the
difference to users will be less surprising.
* JSON in the command line will be ugly… I hope for a better way to do things.
* +1 for SQ_JOB_SUBMISSION, I think submission is generic enough to not require
schema changes.
> Sqoop2: Delta Fetch/ Merge ( formerly called Incremental Import )
> -----------------------------------------------------------------
>
> Key: SQOOP-1168
> URL: https://issues.apache.org/jira/browse/SQOOP-1168
> Project: Sqoop
> Issue Type: Bug
> Reporter: Hari Shreedharan
> Assignee: Veena Basavaraj
> Fix For: 1.99.5
>
>
> The formal design wiki is here
> https://cwiki.apache.org/confluence/display/SQOOP/Delta+Fetch+and+Merge+Design
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)