[
https://issues.apache.org/jira/browse/SQOOP-1168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14272969#comment-14272969
]
Jarek Jarcec Cecho commented on SQOOP-1168:
-------------------------------------------
Thank you for the nice summary [~vybs]!
I do have additional notes for two of the open questions that will hopefully
help show how are my thoughts flowing:
For *Question 2*:
Thank you for persisting the two ways of how we can make the explicit selection
whether user wants to incrementally transfer "only new records" or "new and
updated records". Let me repeat the two options here to make it easier for the
reader to know what I'm referring to. The first option is to explicitly expose
some sort of "job type" in the REST interface and (T|G)UI, e.g. {{create job
--from 1 --to 2 --type INCREMENTAL_NEW_ROWS_ONLY}} (wording is subject to
change). Second one is to require connector to expose certain configuration
options (switches) like "Enum Incremental incremental" with values {{NO}},
{{NEW_ROWS_ONLY}}, {{UPDATES_ALLOWED}} (keywords are just examples).
The configuration classes have been designed and are currently written in a way
that they are opaque to Sqoop. E.g. Sqoop currently can't force the connector
to expose certain configuration options as definition of the classes is fully
in the connector space. We did this intentionally to separate concerns between
what values are required by Sqoop and what are required by the connector. I'm a
bit concerned about changing this idiom in proposed second option by
prescribing some of the required values. I can immediately think of two places
where this would get tricky:
1) _Upgrades._ Right now the configuration object is fully owned and defined by
the connector and so is the {{Ugrader}} that is responsible for upgrading
objects created with older version to the new versions. If we add requirement
for pre-defined fields then the upgrade path will become much more complicated.
The connector will then be responsible for moving values that it do not owns
and in corner cases it might happen that it will be responsible for upgrading
values that it's not even aware of. For example if we prescribe two fields in
Sqoop version A and write connector for version A, I would like this connector
to work with all Sqoop versions A+ (B, C, D, ...). But it could happen that in
Sqoop version B, we would add third prescribed value and in this case the
connector would be responsible to upgrade value that it do not understand at
all because it didn't existed at the time of writing the connector.
For values that are specific to Sqoop and not to the connector, we've
introduced the "framework" (now called "driver") portion of a job. Perhaps we
can re-use this facility to configure the incremental job if we don't want to
go with option 1?
2) _Relevant values._ This is to some extent re-iterating portion of my
previous example into standalone point. Right now the connectors are exposing
what they are supporting and are expected to cover those areas. It should be
very simple for us to add a new functionality into Sqoop without affecting
existing connectors that don't support that new functionality as those simply
did not explicitly signed up to support those. With putting pre-scribed values
into configuration objects we might force connectors to deal with situations
that they might not expect. One concrete example that came to my mind is that
we would pre-scribe {{Enum Incremental}} as mentioned above with values {{NO}},
{{NEW_ROWS_ONLY}}, {{UPDATES_ALLOWED}} in Sqoop version A and wrote a connector
for this version. In Sqoop version B, we would add a new value to the {{Enum}}
with value {{DELETES_ALLOWED}}. Now the configuration object contains an option
that the connector don't support (nor know about it) which might lead to a lot
of troubles (starting from values being ignored and ending with various
unexpected exceptions). Of course that this example could be solved, but I'm
trying to point out that we might open ourselves for quite a lot of troubles in
the future.
For *Question 3*:
I wanted to explain a bit where my thoughts have started with suggestion to
store the last information in the original configuration object. When creating
the job user will have to enter all information about a state from which he
wants to start the incremental import. When running the incremental job it will
end in a different state. Different user might want to start in this second (or
any subsequent) state. Hence I came to conclusion that all the state
information that connector needs must be present in the configuration object.
Otherwise there would be states which user can't never specify (start from).
As [~vybs] mentioned above, the connector might want to store additional
information from last runs to make better (more performant) decision. If we do
have state A and state B, connector have to be able to get from A to B
regardless whether A was specified by user or is a result from previous
incremental import. But getting from A to B where A is a result from previous
incremental import might be a bit fast if connector preserved some additional
information.
Coming to a conclusion that all that user is specifying in an incremental
import (incremental related parameters) is an "initial state" it sound
confusing to me that when running {{show job}} I would see the initial state
rather then last one. To put an example, if I have incremental import that
requires parameter "last imported value" and I initially configured it to "10",
run the job and still see the "10" when running {{show job}}, I would be
confused from that.
I also thought about a case when user decides to reconfigure the incremental
job if the "last state" is stored aside from configuration object. As user
will change the configuration object, the connectors then will be somehow
responsible to make a check {{if (last modification of the configuration object
> last run of the job) useConfigurationObject() else useLastRunOfTheJob()}}. As
this condition would have to be in all connectors, I'm afraid that some of them
might implement it a bit differently which will cause incremental import to
work differently in different connectors.
I understand the concern about allowing connectors to change entire
configuration objects as that is something that user might not expect. I would
also be concerned about a case when connector would change unrelated parameter
(let say JDBC URL) as part of incremental import. I'm wondering if defining an
annotation (or extending existing one) that certain parameters are part of a
state and clearly mention in UI that those will change as the job runs, ensure
that only parameters tagged with this annotation are changed (and no others)
would help solve the concern? With SQOOP-1983, we would even have a history how
the value changed over time.
> Sqoop2: Delta Fetch/ Merge ( formerly called Incremental Import )
> -----------------------------------------------------------------
>
> Key: SQOOP-1168
> URL: https://issues.apache.org/jira/browse/SQOOP-1168
> Project: Sqoop
> Issue Type: Bug
> Reporter: Hari Shreedharan
> Assignee: Veena Basavaraj
> Fix For: 1.99.5
>
>
> The formal design wiki is here
> https://cwiki.apache.org/confluence/display/SQOOP/Delta+Fetch+and+Merge+Design
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)