[
https://issues.apache.org/jira/browse/SQOOP-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14281953#comment-14281953
]
Hari Shreedharan commented on SQOOP-1938:
-----------------------------------------
TL;DR: Parallelize reads and writes rather than have them be sequential.
Most of the threading magic is for a pretty simple reason - each mapper does
I/O in 2 places - one is writes to HDFS, the other is read from the DB (at that
time, extend it to the new from/to architecture, you'd still have 2 I/O). By
having a linear read-write code, you are essentially not reading anything while
the write is happening, which seems like a pretty inefficient thing to do - you
could easily read while the write is happening by parallelizing the reads and
writes, which is what is being done. In addition, there is also some additional
processing/handling that the output format does, which can cost time and CPU -
at which point you could rather read from the DB.
> DOC:update the sqoop MR engine implementation details
> -----------------------------------------------------
>
> Key: SQOOP-1938
> URL: https://issues.apache.org/jira/browse/SQOOP-1938
> Project: Sqoop
> Issue Type: Sub-task
> Reporter: Veena Basavaraj
> Assignee: Veena Basavaraj
> Fix For: 1.99.5
>
>
> https://cwiki.apache.org/confluence/display/SQOOP/Sqoop+MR+Execution+Engine
> 1. Why we need SqoopWritable, what can be done in future?
> 2. Even though we call sqoop as a map only, is that how it always works? what
> happend when numLoaders is non zero
> {code}
> // Set number of reducers as number of configured loaders or suppress
> // reduce phase entirely if loaders are not set at all.
> if(request.getLoaders() != null) {
> job.setNumReduceTasks(request.getLoaders());
> } else {
> job.setNumReduceTasks(0);
> }
> {code}
> 3. Internals of SqoopNullOutputFormat and how SqoopWritable is used in it
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)