[
https://issues.apache.org/jira/browse/SQOOP-2025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14290675#comment-14290675
]
Veena Basavaraj commented on SQOOP-2025:
----------------------------------------
An example of why we need BEFORE and AFTER state from [~anandriyer]
{quote}
Lets take a Kafka reader that performs batch ingest from Kafka, that has an
offset (referred to as OF in rest of this doc).
OF can be edited by both:
- the connector, since at the end of a run the connector will update the last
offset that was read
- the user, since the user may decide to re-read or skip ahead
At the end of run R3, the connector sets OF = 100.
However, user changes OF to 90.
Thus, at the start of run R4, OF = 90.
Thus, the value of OF is different between the end of R3 and the start of R4.
Hence our history will have to display what the value of OF was at the end of
R3, and also display what OF was at the start of R4.
The actual implementation may not need us to store before and after, but the
user will need to know what the value of OF was after a run and before the next
run.
{quote}
Here are some details on proposal discussion that happened offline between
Anand, Jarcec and Veena
The scope of this proposal was to track the before and after config input
values for every job run/ submission. So that it helps the users debug and
track what happened across job runs. especially in the case of delta fetch and
merge ( SQOOP-1168) that cam modify the inputs in not so obvious ways depending
on how the connector operates.
As this proposal describes, we have the ability to edit config inputs now
https://cwiki.apache.org/confluence/display/SQOOP/Sqoop+Config+as+Top+Level+Entity
If we read the details in the proposal, it is clear that as of today we do not
allow adding/deleting configs and inputs via the rest API or command line.
It is strictly restricted to the configurable code [1] ( connectors and the
driver ). One other feature that has absolutely no documentation about the
design nor the feature is how the connector upgrade code works.
To briefly explain it : Connectors have a revision associated with the jar
artifact that they publish. In a newer revision, connectors have been given the
power to change/ delete/ add any of the config inputs they expose in @Config
annotated classes. A feature that does seem over engineered to me at this
point. Even more interesting is how the upgrade code works every time a new
revision of connector is added.[2].
Lets take an example.
if the current installation happens to have a connector "C" with a Config class
names "config" and Inputs "I1" and "I2" we actually verify that these inputs
are still in the new version of the connector been installed/ registered. If
the "config" name changed to "config1" or "I1" and "I2" changed, to "I3" and
"I4", we want to update to the new values.
Now a common way this is done in databases to do this is change the existing
SQ_CONFIG name, if a new one was added, add a new row to this table, if a input
was deleted, then make all the corresponding inputs and values invisble ( with
a flag) that these are no longer relevant. So in this approach, we modify the
values in the corresponding rows. We never delete the entire row and make it
disappear, if we did that we have record of what shenanigans a connector code
did. The data can be in a very corrupt state at this point, if we allowed
overwriting values without any tracking of who and what happened
But instead in the current code the way upgrade is done
All the SQ_JOB_INPUT rows corresponding to that connector are deleted, all the
SQ_LINK_INPUT rows corresponding to that connector are deleted , all SQ_INPUT
entries are deleted, all SQ_CONFIG entries for that connector are deleted. and
then recreated again with new primary key ids, No tracking exists on who
changed it, what was the previous value, etc, It just disappears.
Having to live with this implementation, It is clear that just adding
submissionId to the SQ_JOB_INPUT table and keeping the history of the values in
this table is not enough. We have to modify the upgrade code to do few things
1. upgrade code should never drop records in the first place. there are easier
ways to achieve changed to names of inputs and attributes in configs, The
connectors can be smart in telling which inputs changed every revision by
simply proving the "key" that uniquely identifies the row, it can be name of
the input since it is unique and then the new value for that "key", which can
be a map, for instance
if there is SQ_INPUT with name "foo", type "long", sensitive "true", and say
we want to change the name to "bar", type to "String", we can send a value in
the upgrade code
"foo" : { "name" : "bar" , type : "String"}
Instead we erase the whole entry, in order to erase the whole entry in
SQ_INPUT< every single value in SQ_JOB_INPUT and SQ_LINK_INPUT is dropped. This
is not just bad from performance perspective, but this is not how edits are
done, imagine doing the same from REST API or command line, would be drop all
FK entries to actually update a row ? this would be really absurd if edits to
a table entry was done this way for REST API, so why is connector code anything
special, what is the justification to perform this, I am unclear at this point.
If the config input values history grows, dropping every value and recreating
is costly.
More discussions details will be added soon to this ticket.
[1] :
https://cwiki.apache.org/confluence/display/SQOOP/Sqoop+2+(1.99.4)+Entity+Nomenclature+and+Relationships
[2]:
https://github.com/apache/sqoop/blob/sqoop2/core/src/main/java/org/apache/sqoop/repository/Repository.java#L433
> Input/State history per job run / submission
> --------------------------------------------
>
> Key: SQOOP-2025
> URL: https://issues.apache.org/jira/browse/SQOOP-2025
> Project: Sqoop
> Issue Type: Sub-task
> Reporter: Veena Basavaraj
> Assignee: Veena Basavaraj
> Fix For: 2.0.0
>
>
> As per SQOOP-1804, we will be storing both treating both the config inputs
> and intermediate state generated as part of the job run in the config object.
> Currently the config object is stored in the repository model under
> {code}SQ_CONFIG{code} table. It is per SQ_CONFIGURABLE.
> The inputs within the Config class and its attirbutes are stored in the
> {code}SQ_INPUT{code}
> i,e the columns in the SQ_INPUT map to the attributed of the config @Input
> annotation
> {code}
> @Input(size = 50)
> public String schemaName;
> @Input(size = 50)
> public String tableName;
> {code}
> The actual values for the SQ_INPUT keys per sqoop job are stored in
> {code}
> SQ_JOB_INPUT and SQ_LINK_INPUT
> {code}
> So this means we overwrite the config input values for every job run.
> Lets take an example.
> if a job is started with config value for key "test" as foo, the first job
> run the SQ_INPUT will reflect the value foo. Before the second run, say the
> value was modified to "bar" then the SQ_INPUT table will reflect the value
> "bar", if the user were supposed to query the config values based on the job
> Id, they will only see the last value modified i.e "bar", it does not tell
> the user the value that was used before and job run started and the value the
> job run / submission ended.
> The proposal is to provide this history so that the user can track per job
> run the config input values.
> A simple proposal is to have a FK submission_id in the SQ_JOB_INPUT table,
> and SQ_LINK_INPUT table.
> [~anandriyer] also suggested we store before/ after config state if possible
> To do the BEFORE/AFTER config history,
> 1. We will create a new set of values for each config inputs for every job
> run, based on the prev state ( or ) if the user edits the configs while the
> prev job is running, create new ones with null submissionId, and associate it
> will the submission Id once the job run starts. Once the job run finishes, we
> will write the config values again to store the AFTER information
> 2. We will need to store the BEFORE/AFTER indicator in another column.
> 3. We will make only the last run config input values editable if the job has
> not yet started.
>
> Pros:
> We have a history per job run that we can query
> We do not have race conditions on config input value edits, since every job
> run has its own state
> Cons
> We will have a lot of entries in the SQ_JOB_INPUT and SQ_LINK_INPUT than we
> have now, but I see this unprecedented if we need to provide easy
> debuggability to the users on what inputs and values were used every job
> run, what values where edited etc.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)