[jira] [Commented] (SQOOP-1168) Sqoop2: Delta Fetch/ Merge ( formerly called Incremental Import )

Jarek Jarcec Cecho (JIRA) Sun, 11 Jan 2015 08:48:06 -0800

    [ 
https://issues.apache.org/jira/browse/SQOOP-1168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14272969#comment-14272969
 ]


Jarek Jarcec Cecho commented on SQOOP-1168:
-------------------------------------------

Thank you for the nice summary [~vybs]!

I do have additional notes for two of the open questions that will hopefully 
help show how are my thoughts flowing:

For *Question 2*:

Thank you for persisting the two ways of how we can make the explicit selection 
whether user wants to incrementally transfer "only new records" or "new and 
updated records". Let me repeat the two options here to make it easier for the 
reader to know what I'm referring to. The first option is to explicitly expose 
some sort of "job type" in the REST interface and (T|G)UI, e.g. {{create job 
--from 1 --to 2 --type INCREMENTAL_NEW_ROWS_ONLY}} (wording is subject to 
change). Second one is to require connector to expose certain configuration 
options (switches) like "Enum Incremental incremental" with values {{NO}}, 
{{NEW_ROWS_ONLY}}, {{UPDATES_ALLOWED}} (keywords are just examples).

The configuration classes have been designed and are currently written in a way 
that they are opaque to Sqoop. E.g. Sqoop currently can't force the connector 
to expose certain configuration options as definition of the classes is fully 
in the connector space. We did this intentionally to separate concerns between 
what values are required by Sqoop and what are required by the connector. I'm a 
bit concerned about changing this idiom in proposed second option by 
prescribing some of the required values. I can immediately think of two places 
where this would get tricky:

1) _Upgrades._ Right now the configuration object is fully owned and defined by 
the connector and so is the {{Ugrader}} that is responsible for upgrading 
objects created with older version to the new versions. If we add requirement 
for pre-defined fields then the upgrade path will become much more complicated. 
The connector will then be responsible for moving values that it do not owns 
and in corner cases it might happen that it will be responsible for upgrading 
values that it's not even aware of. For example if we prescribe two fields in 
Sqoop version A and write connector for version A, I would like this connector 
to work with all Sqoop versions A+ (B, C, D, ...). But it could happen that in 
Sqoop version B, we would add third prescribed value and in this case the 
connector would be responsible to upgrade value that it do not understand at 
all because it didn't existed at the time of writing the connector. 

For values that are specific to Sqoop and not to the connector, we've 
introduced the "framework" (now called "driver") portion of a job. Perhaps we 
can re-use this facility to configure the incremental job if we don't want to 
go with option 1?

2) _Relevant values._ This is to some extent re-iterating portion of my 
previous example into standalone point. Right now the connectors are exposing 
what they are supporting and are expected to cover those areas. It should be 
very simple for us to add a new functionality into Sqoop without affecting 
existing connectors that don't support that new functionality as those simply 
did not explicitly signed up to support those. With putting pre-scribed values 
into configuration objects we might force connectors to deal with situations 
that they might not expect. One concrete example that came to my mind is that 
we would pre-scribe {{Enum Incremental}} as mentioned above with values {{NO}}, 
{{NEW_ROWS_ONLY}}, {{UPDATES_ALLOWED}} in Sqoop version A and wrote a connector 
for this version. In Sqoop version B, we would add a new value to the {{Enum}} 
with value {{DELETES_ALLOWED}}. Now the configuration object contains an option 
that the connector don't support (nor know about it) which might lead to a lot 
of troubles (starting from values being ignored and ending with various 
unexpected exceptions). Of course that this example could be solved, but I'm 
trying to point out that we might open ourselves for quite a lot of troubles in 
the future.

For *Question 3*:

I wanted to explain a bit where my thoughts have started with suggestion to 
store the last information in the original configuration object. When creating 
the job user will have to enter all information about a state from which he 
wants to start the incremental import. When running the incremental job it will 
end in a different state. Different user might want to start in this second (or 
any subsequent) state. Hence I came to conclusion that all the state 
information that connector needs must be present in the configuration object. 
Otherwise there would be states which user can't never specify (start from).  
As [~vybs] mentioned above, the connector might want to store additional 
information from last runs to make better (more performant) decision. If we do 
have state A and state B, connector have to be able to get from A to B 
regardless whether A was specified by user or is a result from previous 
incremental import. But getting from A to B where A is a result from previous 
incremental import might be a bit fast if connector preserved some additional 
information.

Coming to a conclusion that all that user is specifying in an incremental 
import (incremental related parameters) is an "initial state" it sound 
confusing to me that when running {{show job}} I would see the initial state 
rather then last one. To put an example, if I have incremental import that 
requires parameter "last imported value" and I initially configured it to "10", 
run the job and still see the "10" when running {{show job}}, I would be 
confused from that.

I also thought about a case when user decides to reconfigure the incremental 
job if the "last state" is stored aside from configuration object.  As user 
will change the configuration object, the connectors then will be somehow 
responsible to make a check {{if (last modification of the configuration object 
> last run of the job) useConfigurationObject() else useLastRunOfTheJob()}}. As 
this condition would have to be in all connectors, I'm afraid that some of them 
might implement it  a bit differently which will cause incremental import to 
work differently in different connectors.

I understand the concern about allowing connectors to change entire 
configuration objects as that is something that user might not expect. I would 
also be concerned about a case when connector would change unrelated parameter 
(let say JDBC URL) as part of incremental import. I'm wondering if defining an 
annotation (or extending existing one) that certain parameters are part of a 
state and clearly mention in UI that those will change as the job runs, ensure 
that only parameters tagged with this annotation are changed (and no others) 
would help solve the concern? With SQOOP-1983, we would even have a history how 
the value changed over time.

> Sqoop2: Delta Fetch/ Merge ( formerly called Incremental Import )
> -----------------------------------------------------------------
>
>                 Key: SQOOP-1168
>                 URL: https://issues.apache.org/jira/browse/SQOOP-1168
>             Project: Sqoop
>          Issue Type: Bug
>            Reporter: Hari Shreedharan
>            Assignee: Veena Basavaraj
>             Fix For: 1.99.5
>
>
> The formal design wiki is here 
> https://cwiki.apache.org/confluence/display/SQOOP/Delta+Fetch+and+Merge+Design



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (SQOOP-1168) Sqoop2: Delta Fetch/ Merge ( formerly called Incremental Import )

Reply via email to