[jira] [Created] (SQOOP-2904) Oraoop does not distribute data evenly in mappers

Lan Jiang (JIRA) Sun, 10 Apr 2016 17:45:58 -0700

Lan Jiang created SQOOP-2904:
--------------------------------

             Summary: Oraoop does not distribute data evenly in mappers
                 Key: SQOOP-2904
                 URL: https://issues.apache.org/jira/browse/SQOOP-2904
             Project: Sqoop
          Issue Type: Bug
          Components: connectors/oracle
    Affects Versions: 1.4.6
         Environment: RedHat 6.7 
            Reporter: Lan Jiang



When executing sqoop command below with direct option and import data from 
Oracle

sqoop import -Doracle.row.fetch.size=20000 -Doraoop.timestamp.string=false 
--connect jdbc:oracle:thin:@xxx.xx.xx.xxxx -m 50 --direct --username xxx 
--password xxxx --table my_table_name --fetch-size 20000 --target-dir 
/data/temp --null-string '\\N' --null-non-string '\\N'

The message stdout message has

16/04/08 10:39:06 INFO oracle.OraOopDataDrivenDBInputFormat: The table being 
imported by sqoop has 138310664 blocks that have been divided into 101 chunks 
which will be processed in 50 splits. The chunks will be allocated to the 
splits using the method : ROUNDROBIN
16/04/08 10:39:07 INFO mapreduce.JobSubmitter: number of splits:50

Thus 49 mapper is going to work on 2 chunks while 1 mapper is going to work on 
3 chunks. Because that single mapper takes 50% more data then rest of the 
mapper, it takes 50% longer time to finish.  

First of all, in the OraoopUtilities.java, it has a method 
getNumberOfDataChunksPerOracleDataFile

  public static int getNumberOfDataChunksPerOracleDataFile(
    
    int desiredNumberOfMappers, org.apache.hadoop.conf.Configuration conf) {
    final String MAPPER_MULTIPLIER = "oraoop.datachunk.mapper.multiplier";
    final String RESULT_INCREMENT = "oraoop.datachunk.result.increment";

    int numberToMultiplyMappersBy = conf.getInt(MAPPER_MULTIPLIER, 2);
    int numberToIncrementResultBy = conf.getInt(RESULT_INCREMENT, 1);

    // The number of chunks generated will *not* be a multiple of the number of
    // splits,
    // to ensure that each split doesn't always get data from the start of each
    // data-file...
    int numberOfDataChunksPerOracleDataFile =
        (desiredNumberOfMappers * numberToMultiplyMappersBy)
            + numberToIncrementResultBy;

So it looks like it was designed this way on purpose so that the each split 
will not always get data from the start of each data file. 

I thought I could simply configure property oraoop.datachunk.result.increment=0 
to solve the issue, but after testing, it seems it does not change the 
behavior. I then dig deeper and found this method is not actually called 
anywhere in the Sqoop. Instead, in class OraOopDataDrivenDBInputFormat (method 
getSplits), it implements the similar logic again, but this time using 
hard-coded values

    int desiredNumberOfMappers = getDesiredNumberOfMappers(jobContext);
    …
    ...

      // The number of chunks generated will *not* be a multiple of the number
      // of splits,
      // to ensure that each split doesn't always get data from the start of
      // each data-file...
      int numberOfChunksPerOracleDataFile = (desiredNumberOfMappers * 2) + 1;

Thus there is no way to change this behavior other than fixing the code.  

The proposed fixes are:

1. Because the number of chunk is 2* number of mappers + 1, it causes data to 
be distributed unevenly across mappers, prolonging the whole Sqoop process by 
50%. IMHO, the benefit gained by ensuring that each split doesn't always get 
data from the start of each data-file is insignificant compared to the drawback 
of uneven distribution of data.
2. The getSplits method in OraOopDataDrivenDBInputFormat.java should call 
OraoopUtilities class getNumberOfDataChunksPerOracleDataFile so that this 
behavior can be controlled by customization of 
oraoop.datachunk.mapper.multiplier and raoop.datachunk.result.increment options



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (SQOOP-2904) Oraoop does not distribute data evenly in mappers

Reply via email to