[ 
https://issues.apache.org/jira/browse/SQOOP-2904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lan Jiang updated SQOOP-2904:
-----------------------------
    Description: 
When executing sqoop command below with direct option and import data from 
Oracle

sqoop import -Doracle.row.fetch.size=20000 -Doraoop.timestamp.string=false 
--connect jdbc:oracle:thin:@xxx.xx.xx.xxxx -m 50 --direct --username xxx 
--password xxxx --table my_table_name --fetch-size 20000 --target-dir 
/data/temp 

The stdout message shows:

16/04/08 10:39:06 INFO oracle.OraOopDataDrivenDBInputFormat: The table being 
imported by sqoop has 138310664 blocks that have been divided into 101 chunks 
which will be processed in 50 splits. The chunks will be allocated to the 
splits using the method : ROUNDROBIN
16/04/08 10:39:07 INFO mapreduce.JobSubmitter: number of splits:50

Thus 49 mapper is going to work on 2 chunks while 1 mapper is going to work on 
3 chunks. Because that single mapper takes 50% more data then rest of the 
mapper, it takes 50% longer time to finish.  

First of all, in the OraoopUtilities.java, it has a method 
getNumberOfDataChunksPerOracleDataFile

  public static int getNumberOfDataChunksPerOracleDataFile(
    
    int desiredNumberOfMappers, org.apache.hadoop.conf.Configuration conf) {
    final String MAPPER_MULTIPLIER = "oraoop.datachunk.mapper.multiplier";
    final String RESULT_INCREMENT = "oraoop.datachunk.result.increment";

    int numberToMultiplyMappersBy = conf.getInt(MAPPER_MULTIPLIER, 2);
    int numberToIncrementResultBy = conf.getInt(RESULT_INCREMENT, 1);

    // The number of chunks generated will *not* be a multiple of the number of
    // splits,
    // to ensure that each split doesn't always get data from the start of each
    // data-file...
    int numberOfDataChunksPerOracleDataFile =
        (desiredNumberOfMappers * numberToMultiplyMappersBy)
            + numberToIncrementResultBy;

So it looks like it was designed this way on purpose so that the each split 
will not always get data from the start of each data file. 

I thought I could simply configure property oraoop.datachunk.result.increment=0 
to solve the issue, but after testing, it seems it does not change the 
behavior. I then dig deeper and found this method is not actually called 
anywhere in the Sqoop. Instead, in class OraOopDataDrivenDBInputFormat (method 
getSplits), it implements the similar logic again, but this time using 
hard-coded values

    int desiredNumberOfMappers = getDesiredNumberOfMappers(jobContext);
    …
    ...

      // The number of chunks generated will *not* be a multiple of the number
      // of splits,
      // to ensure that each split doesn't always get data from the start of
      // each data-file...
      int numberOfChunksPerOracleDataFile = (desiredNumberOfMappers * 2) + 1;

Thus there is no way to change this behavior other than fixing the code.  

The proposed fixes are:

1. Because the number of chunk is 2* number of mappers + 1, it causes data to 
be distributed unevenly across mappers, prolonging the whole Sqoop process by 
50%. IMHO, the benefit gained by ensuring that each split doesn't always get 
data from the start of each data-file is insignificant compared to the drawback 
of uneven distribution of data.
2. The getSplits method in OraOopDataDrivenDBInputFormat.java should call 
OraoopUtilities class getNumberOfDataChunksPerOracleDataFile so that this 
behavior can be controlled by customization of 
oraoop.datachunk.mapper.multiplier and raoop.datachunk.result.increment options

  was:
When executing sqoop command below with direct option and import data from 
Oracle

sqoop import -Doracle.row.fetch.size=20000 -Doraoop.timestamp.string=false 
--connect jdbc:oracle:thin:@xxx.xx.xx.xxxx -m 50 --direct --username xxx 
--password xxxx --table my_table_name --fetch-size 20000 --target-dir 
/data/temp --null-string '\\N' --null-non-string '\\N'

The message stdout message has

16/04/08 10:39:06 INFO oracle.OraOopDataDrivenDBInputFormat: The table being 
imported by sqoop has 138310664 blocks that have been divided into 101 chunks 
which will be processed in 50 splits. The chunks will be allocated to the 
splits using the method : ROUNDROBIN
16/04/08 10:39:07 INFO mapreduce.JobSubmitter: number of splits:50

Thus 49 mapper is going to work on 2 chunks while 1 mapper is going to work on 
3 chunks. Because that single mapper takes 50% more data then rest of the 
mapper, it takes 50% longer time to finish.  

First of all, in the OraoopUtilities.java, it has a method 
getNumberOfDataChunksPerOracleDataFile

  public static int getNumberOfDataChunksPerOracleDataFile(
    
    int desiredNumberOfMappers, org.apache.hadoop.conf.Configuration conf) {
    final String MAPPER_MULTIPLIER = "oraoop.datachunk.mapper.multiplier";
    final String RESULT_INCREMENT = "oraoop.datachunk.result.increment";

    int numberToMultiplyMappersBy = conf.getInt(MAPPER_MULTIPLIER, 2);
    int numberToIncrementResultBy = conf.getInt(RESULT_INCREMENT, 1);

    // The number of chunks generated will *not* be a multiple of the number of
    // splits,
    // to ensure that each split doesn't always get data from the start of each
    // data-file...
    int numberOfDataChunksPerOracleDataFile =
        (desiredNumberOfMappers * numberToMultiplyMappersBy)
            + numberToIncrementResultBy;

So it looks like it was designed this way on purpose so that the each split 
will not always get data from the start of each data file. 

I thought I could simply configure property oraoop.datachunk.result.increment=0 
to solve the issue, but after testing, it seems it does not change the 
behavior. I then dig deeper and found this method is not actually called 
anywhere in the Sqoop. Instead, in class OraOopDataDrivenDBInputFormat (method 
getSplits), it implements the similar logic again, but this time using 
hard-coded values

    int desiredNumberOfMappers = getDesiredNumberOfMappers(jobContext);
    …
    ...

      // The number of chunks generated will *not* be a multiple of the number
      // of splits,
      // to ensure that each split doesn't always get data from the start of
      // each data-file...
      int numberOfChunksPerOracleDataFile = (desiredNumberOfMappers * 2) + 1;

Thus there is no way to change this behavior other than fixing the code.  

The proposed fixes are:

1. Because the number of chunk is 2* number of mappers + 1, it causes data to 
be distributed unevenly across mappers, prolonging the whole Sqoop process by 
50%. IMHO, the benefit gained by ensuring that each split doesn't always get 
data from the start of each data-file is insignificant compared to the drawback 
of uneven distribution of data.
2. The getSplits method in OraOopDataDrivenDBInputFormat.java should call 
OraoopUtilities class getNumberOfDataChunksPerOracleDataFile so that this 
behavior can be controlled by customization of 
oraoop.datachunk.mapper.multiplier and raoop.datachunk.result.increment options


> Oraoop does not distribute data evenly among mappers
> ----------------------------------------------------
>
>                 Key: SQOOP-2904
>                 URL: https://issues.apache.org/jira/browse/SQOOP-2904
>             Project: Sqoop
>          Issue Type: Bug
>          Components: connectors/oracle
>    Affects Versions: 1.4.6
>         Environment: RedHat 6.7 
>            Reporter: Lan Jiang
>
> When executing sqoop command below with direct option and import data from 
> Oracle
> sqoop import -Doracle.row.fetch.size=20000 -Doraoop.timestamp.string=false 
> --connect jdbc:oracle:thin:@xxx.xx.xx.xxxx -m 50 --direct --username xxx 
> --password xxxx --table my_table_name --fetch-size 20000 --target-dir 
> /data/temp 
> The stdout message shows:
> 16/04/08 10:39:06 INFO oracle.OraOopDataDrivenDBInputFormat: The table being 
> imported by sqoop has 138310664 blocks that have been divided into 101 chunks 
> which will be processed in 50 splits. The chunks will be allocated to the 
> splits using the method : ROUNDROBIN
> 16/04/08 10:39:07 INFO mapreduce.JobSubmitter: number of splits:50
> Thus 49 mapper is going to work on 2 chunks while 1 mapper is going to work 
> on 3 chunks. Because that single mapper takes 50% more data then rest of the 
> mapper, it takes 50% longer time to finish.  
> First of all, in the OraoopUtilities.java, it has a method 
> getNumberOfDataChunksPerOracleDataFile
>   public static int getNumberOfDataChunksPerOracleDataFile(
>     
>     int desiredNumberOfMappers, org.apache.hadoop.conf.Configuration conf) {
>     final String MAPPER_MULTIPLIER = "oraoop.datachunk.mapper.multiplier";
>     final String RESULT_INCREMENT = "oraoop.datachunk.result.increment";
>     int numberToMultiplyMappersBy = conf.getInt(MAPPER_MULTIPLIER, 2);
>     int numberToIncrementResultBy = conf.getInt(RESULT_INCREMENT, 1);
>     // The number of chunks generated will *not* be a multiple of the number 
> of
>     // splits,
>     // to ensure that each split doesn't always get data from the start of 
> each
>     // data-file...
>     int numberOfDataChunksPerOracleDataFile =
>         (desiredNumberOfMappers * numberToMultiplyMappersBy)
>             + numberToIncrementResultBy;
> So it looks like it was designed this way on purpose so that the each split 
> will not always get data from the start of each data file. 
> I thought I could simply configure property 
> oraoop.datachunk.result.increment=0 to solve the issue, but after testing, it 
> seems it does not change the behavior. I then dig deeper and found this 
> method is not actually called anywhere in the Sqoop. Instead, in class 
> OraOopDataDrivenDBInputFormat (method getSplits), it implements the similar 
> logic again, but this time using hard-coded values
>     int desiredNumberOfMappers = getDesiredNumberOfMappers(jobContext);
>     …
>     ...
>       // The number of chunks generated will *not* be a multiple of the number
>       // of splits,
>       // to ensure that each split doesn't always get data from the start of
>       // each data-file...
>       int numberOfChunksPerOracleDataFile = (desiredNumberOfMappers * 2) + 1;
> Thus there is no way to change this behavior other than fixing the code.  
> The proposed fixes are:
> 1. Because the number of chunk is 2* number of mappers + 1, it causes data to 
> be distributed unevenly across mappers, prolonging the whole Sqoop process by 
> 50%. IMHO, the benefit gained by ensuring that each split doesn't always get 
> data from the start of each data-file is insignificant compared to the 
> drawback of uneven distribution of data.
> 2. The getSplits method in OraOopDataDrivenDBInputFormat.java should call 
> OraoopUtilities class getNumberOfDataChunksPerOracleDataFile so that this 
> behavior can be controlled by customization of 
> oraoop.datachunk.mapper.multiplier and raoop.datachunk.result.increment 
> options



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to