[
https://issues.apache.org/jira/browse/SQOOP-2904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Lan Jiang updated SQOOP-2904:
-----------------------------
Summary: Oraoop does not distribute data evenly among mappers (was: Oraoop
does not distribute data evenly in mappers)
> Oraoop does not distribute data evenly among mappers
> ----------------------------------------------------
>
> Key: SQOOP-2904
> URL: https://issues.apache.org/jira/browse/SQOOP-2904
> Project: Sqoop
> Issue Type: Bug
> Components: connectors/oracle
> Affects Versions: 1.4.6
> Environment: RedHat 6.7
> Reporter: Lan Jiang
>
> When executing sqoop command below with direct option and import data from
> Oracle
> sqoop import -Doracle.row.fetch.size=20000 -Doraoop.timestamp.string=false
> --connect jdbc:oracle:thin:@xxx.xx.xx.xxxx -m 50 --direct --username xxx
> --password xxxx --table my_table_name --fetch-size 20000 --target-dir
> /data/temp --null-string '\\N' --null-non-string '\\N'
> The message stdout message has
> 16/04/08 10:39:06 INFO oracle.OraOopDataDrivenDBInputFormat: The table being
> imported by sqoop has 138310664 blocks that have been divided into 101 chunks
> which will be processed in 50 splits. The chunks will be allocated to the
> splits using the method : ROUNDROBIN
> 16/04/08 10:39:07 INFO mapreduce.JobSubmitter: number of splits:50
> Thus 49 mapper is going to work on 2 chunks while 1 mapper is going to work
> on 3 chunks. Because that single mapper takes 50% more data then rest of the
> mapper, it takes 50% longer time to finish.
> First of all, in the OraoopUtilities.java, it has a method
> getNumberOfDataChunksPerOracleDataFile
> public static int getNumberOfDataChunksPerOracleDataFile(
>
> int desiredNumberOfMappers, org.apache.hadoop.conf.Configuration conf) {
> final String MAPPER_MULTIPLIER = "oraoop.datachunk.mapper.multiplier";
> final String RESULT_INCREMENT = "oraoop.datachunk.result.increment";
> int numberToMultiplyMappersBy = conf.getInt(MAPPER_MULTIPLIER, 2);
> int numberToIncrementResultBy = conf.getInt(RESULT_INCREMENT, 1);
> // The number of chunks generated will *not* be a multiple of the number
> of
> // splits,
> // to ensure that each split doesn't always get data from the start of
> each
> // data-file...
> int numberOfDataChunksPerOracleDataFile =
> (desiredNumberOfMappers * numberToMultiplyMappersBy)
> + numberToIncrementResultBy;
> So it looks like it was designed this way on purpose so that the each split
> will not always get data from the start of each data file.
> I thought I could simply configure property
> oraoop.datachunk.result.increment=0 to solve the issue, but after testing, it
> seems it does not change the behavior. I then dig deeper and found this
> method is not actually called anywhere in the Sqoop. Instead, in class
> OraOopDataDrivenDBInputFormat (method getSplits), it implements the similar
> logic again, but this time using hard-coded values
> int desiredNumberOfMappers = getDesiredNumberOfMappers(jobContext);
> …
> ...
> // The number of chunks generated will *not* be a multiple of the number
> // of splits,
> // to ensure that each split doesn't always get data from the start of
> // each data-file...
> int numberOfChunksPerOracleDataFile = (desiredNumberOfMappers * 2) + 1;
> Thus there is no way to change this behavior other than fixing the code.
> The proposed fixes are:
> 1. Because the number of chunk is 2* number of mappers + 1, it causes data to
> be distributed unevenly across mappers, prolonging the whole Sqoop process by
> 50%. IMHO, the benefit gained by ensuring that each split doesn't always get
> data from the start of each data-file is insignificant compared to the
> drawback of uneven distribution of data.
> 2. The getSplits method in OraOopDataDrivenDBInputFormat.java should call
> OraoopUtilities class getNumberOfDataChunksPerOracleDataFile so that this
> behavior can be controlled by customization of
> oraoop.datachunk.mapper.multiplier and raoop.datachunk.result.increment
> options
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)