[jira] [Commented] (SQOOP-1168) Sqoop2: Delta Fetch/ Merge ( formerly called Incremental Import )

Veena Basavaraj (JIRA) Wed, 17 Dec 2014 09:39:04 -0800

    [ 
https://issues.apache.org/jira/browse/SQOOP-1168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14250180#comment-14250180
 ]


Veena Basavaraj commented on SQOOP-1168:
----------------------------------------

[~gwenshap] Will try to answer each of your points here

>>>>Terminology:
Can we add "predicate" to the terminology? "encapsulates the information for 
reading and writing subset or all records from the data source" is not a 
standard definition of a predicate and was a bit confusing to me.

Already answered that we will just stick to "Config" and given an example above 
on how it looks like


>>>>>User facing:
I think the decision whether delta will be used should be in the job 
definition, not when starting job (i.e agree with Vinoth Chandar). First, 
because it matches Sqoop1 behavior, and second because running both incremental 
and non-incremental from same job will make it more difficult to store and 
retrieve the “what did I fetch last time” part of the information. Its not even 
100% clear what is the correct behavior in this case.

Yes, it is on create job since the last comment I mentioned to [~vinothchandar] 
that it make sense and thus we dont need  any create incremental or delta job 
or anything like that, if a connector exposes a corresponding config, we will 
present it to the user, if the values are filled up by the user, then that info 
will be used in the job run, and if those values exists the connector will have 
code to handle the appropriate way reading FROM and writing To their dat source.

I was not sure what matches sqoop1 though. In this design, a user is completely 
agnostic of what type it is, if he fills the correct config values then it will 
work according to that. 

So lets take a case.
1. Create job..-f 1 -t 2, the user then is presented with configs for delta 
fetch and then user fills in the value, He gets a job Id back, The user will 
say start job -j 1 :  the connector code will provide validators, and then once 
it is legit, it will use them and do the reading So in this case it may read 
only records with id > 24 or some records id in between 23 and 80. It depends.

2. So next time the user says start job -j 1, we will use the same values for 
configs that was given, but in case of the what the value should be used, we 
will use the one in the history. Its again how we annotate this config inputs 
to tell the user what is happening



>>>>Connector API:
Connectors seem to need to give a lot of information about capabilities, and 
this may keep growing. In addition to supporting incrementals, TO will need to 
say whether they support updates or just appends. 
getSupportedDirections is nice, but perhaps instead of adding more such methods 
to API, we can use annotations and later introspect them to see what the 
connector supports? Another option is to add interfaces: MongoDBConnector 
implements SqoopConnector extends incremental… but this can get pretty large 
too.

>> getSupportedDirections I think you meant getSupportedDeltaConfigs.  At this 
>> point is it just new config classes. I would move away the annotation in a 
>> blink of a eye. I prefer a base class and then extends for such things. 


>>>>>Predicate API:
You description covers the information on how connectors will use the 
predicate, but I think the framework may need it to store and provide “what did 
I fetch last time” values.
In general, I’d prefer a simpler and perhaps less generic API here. A way for 
connectors to store “last fetched value” in the framework (date for hdfs, 
offsets for kafka, column values for jdbc) and a way to use it.
Repository: Job inputs can work, but I think submission makes more sense 
because it is built to store history from pervious job executions

The counters seems like a good place, let me explore more, I am open to it 
completely as long as this table will be generic. And we can expose a API to 
retrive these values per job easily

>>>REST API - yep, definitely.
sure

Few comments on open questions:
>>>Some connectors only have the concepts of append/incremental - Kafka can’t 
>>>overwrite existing values in a topic. HDFS makes updates challenging and 
>>>knowing which records were updated near impossible. I think exposing the 
>>>difference to users will be less surprising.

As discussed above, every connector can call it whatever it wants to be :) and 
Kafka can call it foo bar as long it can explain in help text what foo bar is. 
HDFS can say all it supports is adding records and not overwrites


>>>JSON in the command line will be ugly… I hope for a better way to do things.

 this is a moot now, since we will use key value pairs like configs.

>>>+1 for SQ_JOB_SUBMISSION, I think submission is generic enough to not 
>>>require schema changes.
already answered

> Sqoop2: Delta Fetch/ Merge ( formerly called Incremental Import )
> -----------------------------------------------------------------
>
>                 Key: SQOOP-1168
>                 URL: https://issues.apache.org/jira/browse/SQOOP-1168
>             Project: Sqoop
>          Issue Type: Bug
>            Reporter: Hari Shreedharan
>            Assignee: Veena Basavaraj
>             Fix For: 1.99.5
>
>
> The formal design wiki is here 
> https://cwiki.apache.org/confluence/display/SQOOP/Delta+Fetch+and+Merge+Design



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (SQOOP-1168) Sqoop2: Delta Fetch/ Merge ( formerly called Incremental Import )

Reply via email to