[ 
https://issues.apache.org/jira/browse/SQOOP-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14281953#comment-14281953
 ] 

Hari Shreedharan commented on SQOOP-1938:
-----------------------------------------

TL;DR: Parallelize reads and writes rather than have them be sequential.

Most of the threading magic is for a pretty simple reason - each mapper does 
I/O in 2 places - one is writes to HDFS, the other is read from the DB (at that 
time, extend it to the new from/to architecture, you'd still have 2 I/O). By 
having a linear read-write code, you are essentially not reading anything while 
the write is happening, which seems like a pretty inefficient thing to do - you 
could easily read while the write is happening by parallelizing the reads and 
writes, which is what is being done. In addition, there is also some additional 
processing/handling that the output format does, which can cost time and CPU - 
at which point you could rather read from the DB. 



> DOC:update the sqoop MR engine implementation details
> -----------------------------------------------------
>
>                 Key: SQOOP-1938
>                 URL: https://issues.apache.org/jira/browse/SQOOP-1938
>             Project: Sqoop
>          Issue Type: Sub-task
>            Reporter: Veena Basavaraj
>            Assignee: Veena Basavaraj
>             Fix For: 1.99.5
>
>
> https://cwiki.apache.org/confluence/display/SQOOP/Sqoop+MR+Execution+Engine
> 1. Why we need SqoopWritable, what can be done in future?
> 2. Even though we call sqoop as a map only, is that how it always works? what 
> happend when numLoaders is non zero
> {code}
>       // Set number of reducers as number of configured loaders  or suppress
>       // reduce phase entirely if loaders are not set at all.
>       if(request.getLoaders() != null) {
>         job.setNumReduceTasks(request.getLoaders());
>       } else {
>         job.setNumReduceTasks(0);
>       }
> {code}
> 3. Internals of SqoopNullOutputFormat and how SqoopWritable is used in it



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to