[ 
https://issues.apache.org/jira/browse/SQOOP-2904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lan Jiang updated SQOOP-2904:
-----------------------------
    Summary: Oraoop does not distribute data evenly among mappers  (was: Oraoop 
does not distribute data evenly in mappers)

> Oraoop does not distribute data evenly among mappers
> ----------------------------------------------------
>
>                 Key: SQOOP-2904
>                 URL: https://issues.apache.org/jira/browse/SQOOP-2904
>             Project: Sqoop
>          Issue Type: Bug
>          Components: connectors/oracle
>    Affects Versions: 1.4.6
>         Environment: RedHat 6.7 
>            Reporter: Lan Jiang
>
> When executing sqoop command below with direct option and import data from 
> Oracle
> sqoop import -Doracle.row.fetch.size=20000 -Doraoop.timestamp.string=false 
> --connect jdbc:oracle:thin:@xxx.xx.xx.xxxx -m 50 --direct --username xxx 
> --password xxxx --table my_table_name --fetch-size 20000 --target-dir 
> /data/temp --null-string '\\N' --null-non-string '\\N'
> The message stdout message has
> 16/04/08 10:39:06 INFO oracle.OraOopDataDrivenDBInputFormat: The table being 
> imported by sqoop has 138310664 blocks that have been divided into 101 chunks 
> which will be processed in 50 splits. The chunks will be allocated to the 
> splits using the method : ROUNDROBIN
> 16/04/08 10:39:07 INFO mapreduce.JobSubmitter: number of splits:50
> Thus 49 mapper is going to work on 2 chunks while 1 mapper is going to work 
> on 3 chunks. Because that single mapper takes 50% more data then rest of the 
> mapper, it takes 50% longer time to finish.  
> First of all, in the OraoopUtilities.java, it has a method 
> getNumberOfDataChunksPerOracleDataFile
>   public static int getNumberOfDataChunksPerOracleDataFile(
>     
>     int desiredNumberOfMappers, org.apache.hadoop.conf.Configuration conf) {
>     final String MAPPER_MULTIPLIER = "oraoop.datachunk.mapper.multiplier";
>     final String RESULT_INCREMENT = "oraoop.datachunk.result.increment";
>     int numberToMultiplyMappersBy = conf.getInt(MAPPER_MULTIPLIER, 2);
>     int numberToIncrementResultBy = conf.getInt(RESULT_INCREMENT, 1);
>     // The number of chunks generated will *not* be a multiple of the number 
> of
>     // splits,
>     // to ensure that each split doesn't always get data from the start of 
> each
>     // data-file...
>     int numberOfDataChunksPerOracleDataFile =
>         (desiredNumberOfMappers * numberToMultiplyMappersBy)
>             + numberToIncrementResultBy;
> So it looks like it was designed this way on purpose so that the each split 
> will not always get data from the start of each data file. 
> I thought I could simply configure property 
> oraoop.datachunk.result.increment=0 to solve the issue, but after testing, it 
> seems it does not change the behavior. I then dig deeper and found this 
> method is not actually called anywhere in the Sqoop. Instead, in class 
> OraOopDataDrivenDBInputFormat (method getSplits), it implements the similar 
> logic again, but this time using hard-coded values
>     int desiredNumberOfMappers = getDesiredNumberOfMappers(jobContext);
>     …
>     ...
>       // The number of chunks generated will *not* be a multiple of the number
>       // of splits,
>       // to ensure that each split doesn't always get data from the start of
>       // each data-file...
>       int numberOfChunksPerOracleDataFile = (desiredNumberOfMappers * 2) + 1;
> Thus there is no way to change this behavior other than fixing the code.  
> The proposed fixes are:
> 1. Because the number of chunk is 2* number of mappers + 1, it causes data to 
> be distributed unevenly across mappers, prolonging the whole Sqoop process by 
> 50%. IMHO, the benefit gained by ensuring that each split doesn't always get 
> data from the start of each data-file is insignificant compared to the 
> drawback of uneven distribution of data.
> 2. The getSplits method in OraOopDataDrivenDBInputFormat.java should call 
> OraoopUtilities class getNumberOfDataChunksPerOracleDataFile so that this 
> behavior can be controlled by customization of 
> oraoop.datachunk.mapper.multiplier and raoop.datachunk.result.increment 
> options



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to