Lan Jiang created SQOOP-2904:
--------------------------------
Summary: Oraoop does not distribute data evenly in mappers
Key: SQOOP-2904
URL: https://issues.apache.org/jira/browse/SQOOP-2904
Project: Sqoop
Issue Type: Bug
Components: connectors/oracle
Affects Versions: 1.4.6
Environment: RedHat 6.7
Reporter: Lan Jiang
When executing sqoop command below with direct option and import data from
Oracle
sqoop import -Doracle.row.fetch.size=20000 -Doraoop.timestamp.string=false
--connect jdbc:oracle:thin:@xxx.xx.xx.xxxx -m 50 --direct --username xxx
--password xxxx --table my_table_name --fetch-size 20000 --target-dir
/data/temp --null-string '\\N' --null-non-string '\\N'
The message stdout message has
16/04/08 10:39:06 INFO oracle.OraOopDataDrivenDBInputFormat: The table being
imported by sqoop has 138310664 blocks that have been divided into 101 chunks
which will be processed in 50 splits. The chunks will be allocated to the
splits using the method : ROUNDROBIN
16/04/08 10:39:07 INFO mapreduce.JobSubmitter: number of splits:50
Thus 49 mapper is going to work on 2 chunks while 1 mapper is going to work on
3 chunks. Because that single mapper takes 50% more data then rest of the
mapper, it takes 50% longer time to finish.
First of all, in the OraoopUtilities.java, it has a method
getNumberOfDataChunksPerOracleDataFile
public static int getNumberOfDataChunksPerOracleDataFile(
int desiredNumberOfMappers, org.apache.hadoop.conf.Configuration conf) {
final String MAPPER_MULTIPLIER = "oraoop.datachunk.mapper.multiplier";
final String RESULT_INCREMENT = "oraoop.datachunk.result.increment";
int numberToMultiplyMappersBy = conf.getInt(MAPPER_MULTIPLIER, 2);
int numberToIncrementResultBy = conf.getInt(RESULT_INCREMENT, 1);
// The number of chunks generated will *not* be a multiple of the number of
// splits,
// to ensure that each split doesn't always get data from the start of each
// data-file...
int numberOfDataChunksPerOracleDataFile =
(desiredNumberOfMappers * numberToMultiplyMappersBy)
+ numberToIncrementResultBy;
So it looks like it was designed this way on purpose so that the each split
will not always get data from the start of each data file.
I thought I could simply configure property oraoop.datachunk.result.increment=0
to solve the issue, but after testing, it seems it does not change the
behavior. I then dig deeper and found this method is not actually called
anywhere in the Sqoop. Instead, in class OraOopDataDrivenDBInputFormat (method
getSplits), it implements the similar logic again, but this time using
hard-coded values
int desiredNumberOfMappers = getDesiredNumberOfMappers(jobContext);
…
...
// The number of chunks generated will *not* be a multiple of the number
// of splits,
// to ensure that each split doesn't always get data from the start of
// each data-file...
int numberOfChunksPerOracleDataFile = (desiredNumberOfMappers * 2) + 1;
Thus there is no way to change this behavior other than fixing the code.
The proposed fixes are:
1. Because the number of chunk is 2* number of mappers + 1, it causes data to
be distributed unevenly across mappers, prolonging the whole Sqoop process by
50%. IMHO, the benefit gained by ensuring that each split doesn't always get
data from the start of each data-file is insignificant compared to the drawback
of uneven distribution of data.
2. The getSplits method in OraOopDataDrivenDBInputFormat.java should call
OraoopUtilities class getNumberOfDataChunksPerOracleDataFile so that this
behavior can be controlled by customization of
oraoop.datachunk.mapper.multiplier and raoop.datachunk.result.increment options
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)