[ 
https://issues.apache.org/jira/browse/SQOOP-2025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14290675#comment-14290675
 ] 

Veena Basavaraj commented on SQOOP-2025:
----------------------------------------

An example of why we need BEFORE and AFTER state from [~anandriyer]
{quote}


Lets take a Kafka reader that performs batch ingest from Kafka, that has an 
offset (referred to as OF in rest of this doc).
OF can be edited by both:
- the connector, since at the end of a run the connector will update the last 
offset that was read
- the user, since the user may decide to re-read or skip ahead

At the end of run R3, the connector sets OF = 100.

However, user changes OF to 90.

Thus, at the start of run R4, OF = 90.

Thus, the value of OF is different between the end of R3 and the start of R4. 
Hence our history will have to display what the value of OF was at the end of 
R3, and also display what OF was at the start of R4.

The actual implementation may not need us to store before and after, but the 
user will need to know what the value of OF was after a run and before the next 
run.

{quote}


Here are some details  on proposal discussion that happened offline between 
Anand, Jarcec and Veena

The scope of this proposal was to track the before and after config input 
values for every job run/ submission. So that it helps the users debug and 
track what happened across job runs. especially in the case of delta fetch and 
merge ( SQOOP-1168) that cam modify the inputs in not so obvious ways depending 
on how the connector operates.

As this proposal describes, we have the ability to edit config inputs now
https://cwiki.apache.org/confluence/display/SQOOP/Sqoop+Config+as+Top+Level+Entity

If we read the details in the proposal, it is clear that as of today we do not 
allow adding/deleting configs and inputs via the rest API or command line.

It is strictly restricted to the configurable code [1] ( connectors and the 
driver ). One other feature that has absolutely no documentation about the 
design nor the feature is how the connector upgrade code works.

To briefly explain it : Connectors have a revision associated with the jar 
artifact that they publish. In a newer revision, connectors have been given the 
power to change/ delete/ add any of the config inputs they expose in @Config 
annotated classes. A feature that does seem over engineered to me at this 
point. Even more interesting is how the upgrade code works every time a new 
revision of connector is added.[2]. 

Lets take an example.

if the current installation happens to have a connector "C" with a Config class 
names "config" and Inputs "I1" and "I2" we actually verify that these inputs 
are still in the new version of the connector been installed/ registered.  If 
the "config"  name changed to "config1"  or  "I1" and "I2" changed, to "I3" and 
"I4", we want to update to the new values.

Now a common way this is done in databases to do this is change the existing 
SQ_CONFIG name, if a new one was added, add a new row to this table, if a input 
was deleted, then make all the corresponding inputs and values invisble ( with 
a flag) that these are no longer relevant. So in this approach, we modify the 
values in the corresponding rows. We never delete the entire row and make it 
disappear, if we did that we have record of what shenanigans a connector code 
did. The data can be in a very corrupt state at this point, if we allowed 
overwriting values without any tracking of who and what happened

But instead in the current code the way upgrade is done

All the SQ_JOB_INPUT rows corresponding to that connector  are deleted, all the 
SQ_LINK_INPUT rows corresponding to that connector are deleted , all SQ_INPUT 
entries are deleted, all SQ_CONFIG entries for that connector are deleted.  and 
then recreated again with new primary key ids, No tracking exists on who 
changed it, what was the previous value, etc, It just disappears.

Having to live with this implementation,  It is clear that just adding 
submissionId to the SQ_JOB_INPUT table and keeping the history of the values in 
this table is not enough. We have to modify the upgrade code to do few things

1. upgrade code should never drop records in the first place. there are easier 
ways to achieve changed to names of inputs and attributes in configs, The 
connectors can be smart in telling which inputs changed every revision by 
simply proving the "key"  that uniquely identifies the row, it can be name of 
the input since it is unique and then the new value for that "key", which can 
be a map, for instance

if there is SQ_INPUT with name "foo", type "long",  sensitive "true", and say 
we want to change the name to "bar", type to "String",  we can send a value in 
the upgrade code

"foo" : { "name" : "bar" , type : "String"}

Instead we erase the whole entry, in order to erase the whole entry in 
SQ_INPUT< every single value in SQ_JOB_INPUT and SQ_LINK_INPUT is dropped. This 
is not just bad from performance perspective, but this is not how edits are 
done, imagine doing the same from REST API or command line, would be drop all 
FK entries  to actually update a row ? this would be really absurd if edits to 
a table entry was done this way for REST API, so why is connector code anything 
special, what is the justification to perform this, I am unclear at this point.

If the config input values history grows, dropping every value and recreating 
is costly. 


More discussions details will be added soon to this ticket.


[1] : 
https://cwiki.apache.org/confluence/display/SQOOP/Sqoop+2+(1.99.4)+Entity+Nomenclature+and+Relationships

[2]: 
https://github.com/apache/sqoop/blob/sqoop2/core/src/main/java/org/apache/sqoop/repository/Repository.java#L433

> Input/State history per job run / submission
> --------------------------------------------
>
>                 Key: SQOOP-2025
>                 URL: https://issues.apache.org/jira/browse/SQOOP-2025
>             Project: Sqoop
>          Issue Type: Sub-task
>            Reporter: Veena Basavaraj
>            Assignee: Veena Basavaraj
>             Fix For: 2.0.0
>
>
> As per SQOOP-1804, we will be storing both treating both the config inputs 
> and intermediate state generated as part of the job run in the config object. 
> Currently the config object is stored in the repository model under 
> {code}SQ_CONFIG{code} table. It is per SQ_CONFIGURABLE. 
> The inputs within the Config class  and its attirbutes are stored in the 
> {code}SQ_INPUT{code}
> i,e the columns in the SQ_INPUT map to the attributed of the config @Input 
> annotation
> {code}
>  @Input(size = 50)
>   public String schemaName;
>   @Input(size = 50)
>   public String tableName;
> {code}
> The actual values for the SQ_INPUT keys per sqoop job are stored in
> {code}
> SQ_JOB_INPUT and SQ_LINK_INPUT 
> {code}
> So this means we overwrite the config input values for every job run.
> Lets take an example.
> if a job is started with config value for key "test" as foo, the first job 
> run the SQ_INPUT will reflect the value foo. Before the second run, say the 
> value was modified to "bar" then the SQ_INPUT table will reflect the value 
> "bar", if the user were supposed to query the config values based on the job 
> Id, they will only see the last value modified i.e "bar", it does not tell 
> the user the value that was used before and job run started and the value the 
> job run / submission ended.
> The proposal is to provide this history so that the user can track per job 
> run the config input values.
> A simple proposal is to have a FK submission_id in the SQ_JOB_INPUT table,
> and SQ_LINK_INPUT table.
> [~anandriyer] also suggested we store before/ after config state if possible
> To do the BEFORE/AFTER config history, 
> 1. We will create a new set of values for each config inputs for every job 
> run, based on the prev state ( or ) if the user edits the configs while the 
> prev job is running, create new ones with null submissionId, and associate it 
> will the submission Id once the job run starts. Once the job run finishes, we 
> will write the config values again to store the AFTER information
> 2. We will need to store the BEFORE/AFTER indicator in another column. 
> 3. We will make only the last run config input values editable if the job has 
> not yet started.
>  
> Pros:
> We have a history per job run that we can query
> We do not have race conditions on config input value edits, since every job 
> run has its own state
> Cons
> We will have a lot of entries in the SQ_JOB_INPUT and SQ_LINK_INPUT than we 
> have now, but I see this unprecedented if we need to provide easy 
> debuggability to the users on what inputs and values were used every  job 
> run, what values where edited etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to