[
https://issues.apache.org/jira/browse/SQOOP-1603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14248097#comment-14248097
]
Vinoth Chandar commented on SQOOP-1603:
---------------------------------------
Thanks for looping me in [~vybs] .
Destroyer or Terminator or Concluder or Finisher or Ender or ... :) does not
seem like the right place to do any actual post processing of the data set (as
in the kite connector case). From what I understand, its a not a parallel stage
but called once from the application master (?)
(https://hadoop.apache.org/docs/r2.3.0/api/org/apache/hadoop/mapred/OutputCommitter.html#commitJob%28org.apache.hadoop.mapred.JobContext%29)
.
How about providing the following..
0. Basic idea is to have FROM connector run in map stage and TO connector in
the reduce stage.
1. Introduce a notion of partitions for the TO connectors also.
2. Provide a means for translating a FROM connector's partition to a TO
partition. (Internally, for every record extracted, Sqoop will do standard MR
'emit' to the corresponding To Loader instance)
(By default we can provide a IdentityMapping between FROM and TO connectors and
in such cases run everything in the mapper itself like today, as an
optimization to avoid the unnecessary IPC).
This will provide means to organize data flexibly across each FROM Partition.
Eg:
Consider a shopping cart database with orders,page_visits, user tables. If
someone wants to extract records across these tables, and then produce a set of
files for each user/usergroup or so. FROM partitions can be arbitrary range of
rows like in the GenericJdbcConnector today and then all map to a TO partition
for each user/usergroup.
Consider use cases where we need to organize data based on some secondary
column on the source database. Say, grouping orders by country on the TO side.
> Sqoop2: Explicit support for Merge in the Sqoop Job lifecyle
> --------------------------------------------------------------
>
> Key: SQOOP-1603
> URL: https://issues.apache.org/jira/browse/SQOOP-1603
> Project: Sqoop
> Issue Type: Bug
> Reporter: Veena Basavaraj
> Assignee: Qian Xu
> Fix For: 1.99.5
>
>
> The Destroyer api and its javadoc
> {code}
> /**
> * This allows connector to define work to complete execution, for example,
> * resource cleaning.
> */
> public abstract class Destroyer<LinkConfiguration, JobConfiguration> {
> /**
> * Callback to clean up after job execution.
> *
> * @param context Destroyer context
> * @param linkConfiguration link configuration object
> * @param jobConfiguration job configuration object for the FROM and TO
> * In case of the FROM initializer this will represent the FROM job
> configuration
> * In case of the TO initializer this will represent the TO job
> configuration
> */
> public abstract void destroy(DestroyerContext context,
> LinkConfiguration linkConfiguration,
> JobConfiguration jobConfiguration);
> }
> {code}
> This ticket was created while reviewing the Kite Connector use case where the
> destroyer does the actual temp data set merge
> https://reviews.apache.org/r/26963/diff/# [~stanleyxu2005]
> {code}
> public void destroy(DestroyerContext context, LinkConfiguration link,
> ToJobConfiguration job) {
> LOG.info("Running Kite connector destroyer");
> // Every loader instance creates a temporary dataset. If the MR job is
> // successful, all temporary dataset should be merged as one dataset,
> // otherwise they should be deleted all.
> String[] uris = KiteDatasetExecutor.listTemporaryDatasetUris(
> job.toDataset.uri);
> if (context.isSuccess()) {
> KiteDatasetExecutor executor = new
> KiteDatasetExecutor(job.toDataset.uri,
> context.getSchema(), link.link.fileFormat);
> for (String uri : uris) {
> executor.mergeDataset(uri);
> LOG.info(String.format("Temporary dataset %s merged", uri));
> }
> } else {
> for (String uri : uris) {
> KiteDatasetExecutor.deleteDataset(uri);
> LOG.info(String.format("Temporary dataset %s deleted", uri));
> }
> }
> }
> {code}
> Wondering if such things should be its own phase rather than in destroyers.
> The responsibility of destroyer is more to clean up/ closing/ daat sources
> for both FROM/TO data sources to be more precise .. should such operations
> that modify records / merge/ munge be its own step ?.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)