[
https://issues.apache.org/jira/browse/SQOOP-2025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Veena Basavaraj updated SQOOP-2025:
-----------------------------------
Description:
As per SQOOP-1804, we will be storing both treating both the config inputs and
intermediate state generated as part of the job run in the config object.
Currently the config object is stored in the repository model under
{code}SQ_CONFIG{code} table. It is per SQ_CONFIGURABLE.
The inputs within the Config class and its attirbutes are stored in the
{code}SQ_INPUT{code}
i,e the columns in the SQ_INPUT map to the attributed of the config @Input
annotation
{code}
@Input(size = 50)
public String schemaName;
@Input(size = 50)
public String tableName;
{code}
The actual values for the SQ_INPUT keys per sqoop job are stored in
{code}
SQ_JOB_INPUT and SQ_LINK_INPUT
{code}
So this means we overwrite the config input values for every job run.
Lets take an example.
if a job is started with config value for key "test" as foo, the first job run
the SQ_INPUT will reflect the value foo. Before the second run, say the value
was modified to "bar" then the SQ_INPUT table will reflect the value "bar", if
the user were supposed to query the config values based on the job Id, they
will only see the last value modified i.e "bar", it does not tell the user the
value that was used before and job run started and the value the job run /
submission ended.
The proposal is to provide this history so that the user can track per job run
the config input values.
A simple proposal is to have a FK submission_id in the SQ_JOB_INPUT table,
and SQ_LINK_INPUT table.
[~anandriyer] also suggested we store before/ after config state if possible
To do the BEFORE/AFTER config history,
1. We will create a new set of values for each config inputs for every job run,
based on the prev state ( or ) if the user edits the configs while the prev job
is running, create new ones with null submissionId, and associate it will the
submission Id once the job run starts. Once the job run finishes, we will write
the config values again to store the AFTER information
2. We will need to store the BEFORE/AFTER indicator in another column.
3. We will make only the last run config input values editable if the job has
not yet started.
Pros:
We have a history per job run that we can query
We do not have race conditions on config input value edits, since every job run
has its own state
Cons
We will have a lot of entries in the SQ_JOB_INPUT and SQ_LINK_INPUT than we
have now, but I see this unprecedented if we need to provide easy debuggability
to the users on what inputs and values were used every job run, what values
where edited etc.
was:
As per SQOOP-1804, we will be storing both treating both the config inputs and
intermediate state generated as part of the job run in the config object.
Currently the config object is stored in the repository model under
{code}SQ_CONFIG{code} table. It is per SQ_CONFIGURABLE.
The inputs within the Config class and its attirbutes are stored in the
{code}SQ_INPUT{code}
i,e the columns in the SQ_INPUT map to the attributed of the config @Input
annotation
{code}
@Input(size = 50)
public String schemaName;
@Input(size = 50)
public String tableName;
{code}
The actual values for the SQ_INPUT keys per sqoop job are stored in
SQ_JOB_INPUT and SQ_LINK_INPUT
So this means we overwrite the config input values for every job run. . Lets
take an example.
if a job is started with config value for key "test" as foo, the first job run
the SQ_INPUT will reflect the value foo. Before the second run, say the value
was modified to "bar" then the SQ_INPUT table will reflect the value "bar", if
the user were supposed to query the config values based on the job Id, they
will only see the last value modified, it does not tell the user the value that
was used before and job run started and the value the job run / submission
ended.
The proposal is to provide this history so that the user can track per job run
the config input values.
A simple proposal is to have a submission_id in the SQ_JOB_INPUT table,
and SQ_LINK_INPUT table.
[~anandriyer] also suggested we store before/ after config state if possible
To do the BEFORE/AFTER config history,
1. We will create a new set of values for each config inputs for every job run,
based on the prev state ( or ) if the user edits the configs while the prev job
is running, create new ones with null submissionId, and associate it will the
submission Id once the job run starts. Once the job run finishes, we will write
the config values again.
2. We will need to store the BEFORE/AFTER indicator in another column.
3. We will make only the last run config input values editable if the job has
not yet started.
Pros:
We have a history
> Input/State history per job run / submission
> --------------------------------------------
>
> Key: SQOOP-2025
> URL: https://issues.apache.org/jira/browse/SQOOP-2025
> Project: Sqoop
> Issue Type: Sub-task
> Reporter: Veena Basavaraj
> Assignee: Veena Basavaraj
>
> As per SQOOP-1804, we will be storing both treating both the config inputs
> and intermediate state generated as part of the job run in the config object.
> Currently the config object is stored in the repository model under
> {code}SQ_CONFIG{code} table. It is per SQ_CONFIGURABLE.
> The inputs within the Config class and its attirbutes are stored in the
> {code}SQ_INPUT{code}
> i,e the columns in the SQ_INPUT map to the attributed of the config @Input
> annotation
> {code}
> @Input(size = 50)
> public String schemaName;
> @Input(size = 50)
> public String tableName;
> {code}
> The actual values for the SQ_INPUT keys per sqoop job are stored in
> {code}
> SQ_JOB_INPUT and SQ_LINK_INPUT
> {code}
> So this means we overwrite the config input values for every job run.
> Lets take an example.
> if a job is started with config value for key "test" as foo, the first job
> run the SQ_INPUT will reflect the value foo. Before the second run, say the
> value was modified to "bar" then the SQ_INPUT table will reflect the value
> "bar", if the user were supposed to query the config values based on the job
> Id, they will only see the last value modified i.e "bar", it does not tell
> the user the value that was used before and job run started and the value the
> job run / submission ended.
> The proposal is to provide this history so that the user can track per job
> run the config input values.
> A simple proposal is to have a FK submission_id in the SQ_JOB_INPUT table,
> and SQ_LINK_INPUT table.
> [~anandriyer] also suggested we store before/ after config state if possible
> To do the BEFORE/AFTER config history,
> 1. We will create a new set of values for each config inputs for every job
> run, based on the prev state ( or ) if the user edits the configs while the
> prev job is running, create new ones with null submissionId, and associate it
> will the submission Id once the job run starts. Once the job run finishes, we
> will write the config values again to store the AFTER information
> 2. We will need to store the BEFORE/AFTER indicator in another column.
> 3. We will make only the last run config input values editable if the job has
> not yet started.
>
> Pros:
> We have a history per job run that we can query
> We do not have race conditions on config input value edits, since every job
> run has its own state
> Cons
> We will have a lot of entries in the SQ_JOB_INPUT and SQ_LINK_INPUT than we
> have now, but I see this unprecedented if we need to provide easy
> debuggability to the users on what inputs and values were used every job
> run, what values where edited etc.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)