[ 
https://issues.apache.org/jira/browse/SQOOP-2025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Veena Basavaraj updated SQOOP-2025:
-----------------------------------
    Description: 
As per SQOOP-1804, we will be storing both treating both the config inputs and 
intermediate state generated as part of the job run in the config object. 

Currently the config object is stored in the repository model under 
{code}SQ_CONFIG{code} table. It is per SQ_CONFIGURABLE. 

The inputs within the Config class  and its attirbutes are stored in the 
{code}SQ_INPUT{code}

i,e the columns in the SQ_INPUT map to the attributed of the config @Input 
annotation
{code}
 @Input(size = 50)
  public String schemaName;

  @Input(size = 50)
  public String tableName;

{code}

The actual values for the SQ_INPUT keys per sqoop job are stored in
{code}
SQ_JOB_INPUT and SQ_LINK_INPUT 
{code}

So this means we overwrite the config input values for every job run.
Lets take an example.

if a job is started with config value for key "test" as foo, the first job run 
the SQ_INPUT will reflect the value foo. Before the second run, say the value 
was modified to "bar" then the SQ_INPUT table will reflect the value "bar", if 
the user were supposed to query the config values based on the job Id, they 
will only see the last value modified i.e "bar", it does not tell the user the 
value that was used before and job run started and the value the job run / 
submission ended.

The proposal is to provide this history so that the user can track per job run 
the config input values.

A simple proposal is to have a FK submission_id in the SQ_JOB_INPUT table,
and SQ_LINK_INPUT table.

[~anandriyer] also suggested we store before/ after config state if possible

To do the BEFORE/AFTER config history, 

1. We will create a new set of values for each config inputs for every job run, 
based on the prev state ( or ) if the user edits the configs while the prev job 
is running, create new ones with null submissionId, and associate it will the 
submission Id once the job run starts. Once the job run finishes, we will write 
the config values again to store the AFTER information

2. We will need to store the BEFORE/AFTER indicator in another column. 

3. We will make only the last run config input values editable if the job has 
not yet started.

 

Pros:
We have a history per job run that we can query
We do not have race conditions on config input value edits, since every job run 
has its own state

Cons
We will have a lot of entries in the SQ_JOB_INPUT and SQ_LINK_INPUT than we 
have now, but I see this unprecedented if we need to provide easy debuggability 
to the users on what inputs and values were used every  job run, what values 
where edited etc.

  was:
As per SQOOP-1804, we will be storing both treating both the config inputs and 
intermediate state generated as part of the job run in the config object. 

Currently the config object is stored in the repository model under 
{code}SQ_CONFIG{code} table. It is per SQ_CONFIGURABLE. 

The inputs within the Config class  and its attirbutes are stored in the 
{code}SQ_INPUT{code}

i,e the columns in the SQ_INPUT map to the attributed of the config @Input 
annotation
{code}
 @Input(size = 50)
  public String schemaName;

  @Input(size = 50)
  public String tableName;

{code}

The actual values for the SQ_INPUT keys per sqoop job are stored in
SQ_JOB_INPUT and SQ_LINK_INPUT 

So this means we overwrite the config input values for every job run. . Lets 
take an example.

if a job is started with config value for key "test" as foo, the first job run 
the SQ_INPUT will reflect the value foo. Before the second run, say the value 
was modified to "bar" then the SQ_INPUT table will reflect the value "bar", if 
the user were supposed to query the config values based on the job Id, they 
will only see the last value modified, it does not tell the user the value that 
was used before and job run started and the value the job run / submission 
ended.

The proposal is to provide this history so that the user can track per job run 
the config input values.

A simple proposal is to have a submission_id in the SQ_JOB_INPUT table,
and SQ_LINK_INPUT table.

[~anandriyer] also suggested we store before/ after config state if possible

To do the BEFORE/AFTER config history, 
1. We will create a new set of values for each config inputs for every job run, 
based on the prev state ( or ) if the user edits the configs while the prev job 
is running, create new ones with null submissionId, and associate it will the 
submission Id once the job run starts. Once the job run finishes, we will write 
the config values again.

2. We will need to store the BEFORE/AFTER indicator in another column. 

3. We will make only the last run config input values editable if the job has 
not yet started.

 

Pros:
We have a history


> Input/State history per job run / submission
> --------------------------------------------
>
>                 Key: SQOOP-2025
>                 URL: https://issues.apache.org/jira/browse/SQOOP-2025
>             Project: Sqoop
>          Issue Type: Sub-task
>            Reporter: Veena Basavaraj
>            Assignee: Veena Basavaraj
>
> As per SQOOP-1804, we will be storing both treating both the config inputs 
> and intermediate state generated as part of the job run in the config object. 
> Currently the config object is stored in the repository model under 
> {code}SQ_CONFIG{code} table. It is per SQ_CONFIGURABLE. 
> The inputs within the Config class  and its attirbutes are stored in the 
> {code}SQ_INPUT{code}
> i,e the columns in the SQ_INPUT map to the attributed of the config @Input 
> annotation
> {code}
>  @Input(size = 50)
>   public String schemaName;
>   @Input(size = 50)
>   public String tableName;
> {code}
> The actual values for the SQ_INPUT keys per sqoop job are stored in
> {code}
> SQ_JOB_INPUT and SQ_LINK_INPUT 
> {code}
> So this means we overwrite the config input values for every job run.
> Lets take an example.
> if a job is started with config value for key "test" as foo, the first job 
> run the SQ_INPUT will reflect the value foo. Before the second run, say the 
> value was modified to "bar" then the SQ_INPUT table will reflect the value 
> "bar", if the user were supposed to query the config values based on the job 
> Id, they will only see the last value modified i.e "bar", it does not tell 
> the user the value that was used before and job run started and the value the 
> job run / submission ended.
> The proposal is to provide this history so that the user can track per job 
> run the config input values.
> A simple proposal is to have a FK submission_id in the SQ_JOB_INPUT table,
> and SQ_LINK_INPUT table.
> [~anandriyer] also suggested we store before/ after config state if possible
> To do the BEFORE/AFTER config history, 
> 1. We will create a new set of values for each config inputs for every job 
> run, based on the prev state ( or ) if the user edits the configs while the 
> prev job is running, create new ones with null submissionId, and associate it 
> will the submission Id once the job run starts. Once the job run finishes, we 
> will write the config values again to store the AFTER information
> 2. We will need to store the BEFORE/AFTER indicator in another column. 
> 3. We will make only the last run config input values editable if the job has 
> not yet started.
>  
> Pros:
> We have a history per job run that we can query
> We do not have race conditions on config input value edits, since every job 
> run has its own state
> Cons
> We will have a lot of entries in the SQ_JOB_INPUT and SQ_LINK_INPUT than we 
> have now, but I see this unprecedented if we need to provide easy 
> debuggability to the users on what inputs and values were used every  job 
> run, what values where edited etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to