[ https://issues.apache.org/jira/browse/SQOOP-3130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15854178#comment-15854178 ]
Markus Kemper commented on SQOOP-3130: -------------------------------------- >From reviewing this issue further another possible workaround is to set the >(mapreduce.input.fileinputformat.split.minsize) to help control the count of >map tasks used when exporting Avro data files. This method is not likely to >be 100% deterministic if the volume of data is larger that the boundaries of >the "minsize" and count of map tasks requested however it is more ideal than >one map task per Avro data file. The example below demonstrates this >workaround. *Use (mapreduce.input.fileinputformat.split.minsize) without \--num-mappers* {noformat} sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query "delete from t1_text" sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query "select count(*) from t1_text" sqoop export -Dmapreduce.input.fileinputformat.split.minsize=134217728 --connect $MYCONN --username $MYUSER --password $MYPSWD --table T1_TEXT --export-dir /user/root/external/t1_text --input-fields-terminated-by ',' 17/02/06 05:56:25 INFO input.FileInputFormat: Total input paths to process : 10 17/02/06 05:56:25 INFO mapreduce.JobSubmitter: number of splits:4 <=========== uses default --num-mappers value (4) <SNIP> 17/02/06 05:57:11 INFO mapreduce.ExportJobBase: Transferred 3.519 MB in 47.9893 seconds (75.0887 KB/sec) 17/02/06 05:57:11 INFO mapreduce.ExportJobBase: Exported 100000 records. {noformat} *Use (mapreduce.input.fileinputformat.split.minsize) with \--num-mappers* {noformat} sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query "delete from t1_text" sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query "select count(*) from t1_text" sqoop export -Dmapreduce.input.fileinputformat.split.minsize=134217728 --connect $MYCONN --username $MYUSER --password $MYPSWD --table T1_TEXT --export-dir /user/root/external/t1_text --input-fields-terminated-by ',' --num-mappers 2 17/02/06 05:59:12 INFO input.FileInputFormat: Total input paths to process : 10 17/02/06 05:59:12 INFO mapreduce.JobSubmitter: number of splits:2 <=========== uses requested --num-mappers value (2) <SNIP> 17/02/06 05:59:42 INFO mapreduce.ExportJobBase: Transferred 3.5189 MB in 32.3481 seconds (111.3925 KB/sec) 17/02/06 05:59:42 INFO mapreduce.ExportJobBase: Exported 100000 records. {noformat} > Sqoop (export + --export-dir + Avro files) is not obeying --num-mappers > requested > --------------------------------------------------------------------------------- > > Key: SQOOP-3130 > URL: https://issues.apache.org/jira/browse/SQOOP-3130 > Project: Sqoop > Issue Type: Bug > Reporter: Markus Kemper > > When using Sqoop (export + --export-dir + Avro files) it is not obeying > --num-mappers requested, instead it is creating a map task per Avro data > file. > One known workaround for this issue is to use the Sqoop --hcatalog options. > Please see the test case below demonstrating the issue and workaround. > *Test Case* > {noformat} > ################# > # STEP 01 - Create Data > ################# > for i in {1..100000}; do d=`date +"%Y-%m-%d %H:%M:%S" --date="+$i days"`; > echo "$i,$d,row data" >> ./data.csv; done > ls -l ./*; > wc data.csv > hdfs dfs -mkdir -p /user/root/external/t1 > hdfs dfs -put ./data.csv /user/root/external/t1/data.csv > hdfs dfs -ls -R /user/root/external/t1/ > Output: > -rw-r--r-- 1 root root 3488895 Feb 1 11:20 ./data.csv > ~~~~~ > 100000 300000 3488895 data.csv > ~~~~~ > -rw-r--r-- 3 root root 3488895 2017-02-01 11:26 > /user/root/external/t1/data.csv > ################# > # STEP 02 - Create RDBMS Table and Export Data > ################# > export MYCONN=jdbc:oracle:thin:@oracle.cloudera.com:1521/db11g; > export MYUSER=sqoop > export MYPSWD=cloudera > sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query > "drop table t1_text" > sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query > "create table t1_text (c1 int, c2 date, c3 varchar(10))" > sqoop export --connect $MYCONN --username $MYUSER --password $MYPSWD --table > T1_TEXT --export-dir /user/root/external/t1 --input-fields-terminated-by ',' > --num-mappers 1 > sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query > "select count(*) from t1_text" > Output: > 17/02/01 11:33:31 INFO mapreduce.ExportJobBase: Transferred 3.3274 MB in > 24.8037 seconds (137.3688 KB/sec) > 17/02/01 11:33:31 INFO mapreduce.ExportJobBase: Exported 100000 records. > ~~~~~~ > ------------------------ > | COUNT(*) | > ------------------------ > | 100000 | > ------------------------ > ################# > # STEP 03 - Import Data as Text Creating 10 HDFS Files > ################# > sqoop import --connect $MYCONN --username $MYUSER --password $MYPSWD --table > T1_TEXT --target-dir /user/root/external/t1_text --delete-target-dir > --num-mappers 10 --split-by C1 --as-textfile > hdfs dfs -ls /user/root/external/t1_text/part* > Output: > 17/02/01 11:38:26 INFO mapreduce.ImportJobBase: Transferred 3.518 MB in > 57.0517 seconds (63.1434 KB/sec) > 17/02/01 11:38:26 INFO mapreduce.ImportJobBase: Retrieved 100000 records. > ~~~~~ > -rw-r--r-- 3 root root 358894 2017-02-01 11:38 > /user/root/external/t1_text/part-m-00000 > -rw-r--r-- 3 root root 370000 2017-02-01 11:38 > /user/root/external/t1_text/part-m-00001 > -rw-r--r-- 3 root root 370000 2017-02-01 11:38 > /user/root/external/t1_text/part-m-00002 > -rw-r--r-- 3 root root 370000 2017-02-01 11:38 > /user/root/external/t1_text/part-m-00003 > -rw-r--r-- 3 root root 370000 2017-02-01 11:38 > /user/root/external/t1_text/part-m-00004 > -rw-r--r-- 3 root root 370000 2017-02-01 11:38 > /user/root/external/t1_text/part-m-00005 > -rw-r--r-- 3 root root 370000 2017-02-01 11:38 > /user/root/external/t1_text/part-m-00006 > -rw-r--r-- 3 root root 370000 2017-02-01 11:38 > /user/root/external/t1_text/part-m-00007 > -rw-r--r-- 3 root root 370000 2017-02-01 11:38 > /user/root/external/t1_text/part-m-00008 > -rw-r--r-- 3 root root 370001 2017-02-01 11:38 > /user/root/external/t1_text/part-m-00009 > ################# > # STEP 04 - Export 10 Text Formatted Data Using 2 Splits > ################# > sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query > "delete from t1_text" > sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query > "select count(*) from t1_text" > sqoop export --connect $MYCONN --username $MYUSER --password $MYPSWD --table > T1_TEXT --export-dir /user/root/external/t1_text --input-fields-terminated-by > ',' --num-mappers 2 > sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query > "select count(*) from t1_text" > Output: > ------------------------ > | COUNT(*) | > ------------------------ > | 0 | > ------------------------ > ~~~~~ > 17/02/01 11:47:26 INFO input.FileInputFormat: Total input paths to process : > 10 > 17/02/01 11:47:26 INFO mapreduce.JobSubmitter: number of splits:2 > <SNIP> > 17/02/01 11:47:55 INFO mapreduce.ExportJobBase: Transferred 3.5189 MB in > 31.7104 seconds (113.6324 KB/sec) > 17/02/01 11:47:55 INFO mapreduce.ExportJobBase: Exported 100000 records. > ~~~~~ > ------------------------ > | COUNT(*) | > ------------------------ > | 100000 | > ------------------------ > ################# > # STEP 05 - Import Data as Avro Creating 10 HDFS Files > ################# > sqoop import --connect $MYCONN --username $MYUSER --password $MYPSWD --table > T1_TEXT --target-dir /user/root/external/t1_avro --delete-target-dir > --num-mappers 10 --split-by C1 --as-avrodatafile > hdfs dfs -ls /user/root/external/t1_avro/*.avro > Output: > 17/02/01 11:57:38 INFO mapreduce.ImportJobBase: Transferred 2.3703 MB in > 68.454 seconds (35.4568 KB/sec) > 17/02/01 11:57:38 INFO mapreduce.ImportJobBase: Retrieved 100000 records. > ~~~~~ > -rw-r--r-- 3 root root 231119 2017-02-01 11:57 > /user/root/external/t1_avro/part-m-00000.avro > -rw-r--r-- 3 root root 250477 2017-02-01 11:57 > /user/root/external/t1_avro/part-m-00001.avro > -rw-r--r-- 3 root root 250477 2017-02-01 11:57 > /user/root/external/t1_avro/part-m-00002.avro > -rw-r--r-- 3 root root 250477 2017-02-01 11:57 > /user/root/external/t1_avro/part-m-00003.avro > -rw-r--r-- 3 root root 250477 2017-02-01 11:57 > /user/root/external/t1_avro/part-m-00004.avro > -rw-r--r-- 3 root root 250477 2017-02-01 11:57 > /user/root/external/t1_avro/part-m-00005.avro > -rw-r--r-- 3 root root 250477 2017-02-01 11:57 > /user/root/external/t1_avro/part-m-00006.avro > -rw-r--r-- 3 root root 250477 2017-02-01 11:57 > /user/root/external/t1_avro/part-m-00007.avro > -rw-r--r-- 3 root root 250477 2017-02-01 11:57 > /user/root/external/t1_avro/part-m-00008.avro > -rw-r--r-- 3 root root 250478 2017-02-01 11:56 > /user/root/external/t1_avro/part-m-00009.avro > ################# > # STEP 06 - Export 10 Avro Formatted Data Using 2 Splits (reproduction of > issue) > ################# > sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query > "delete from t1_text" > sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query > "select count(*) from t1_text" > sqoop export --connect $MYCONN --username $MYUSER --password $MYPSWD --table > T1_TEXT --export-dir /user/root/external/t1_avro --input-fields-terminated-by > ',' --num-mappers 2 > sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query > "select count(*) from t1_text" > Output: > 17/02/01 12:01:07 INFO input.FileInputFormat: Total input paths to process : > 10 > 17/02/01 12:01:08 INFO mapreduce.JobSubmitter: number of splits:10 > <================== not correct, should be only 2 not 10 > <SNIP> > 17/02/01 12:02:02 INFO mapreduce.ExportJobBase: Transferred 2.4497 MB in > 57.1965 seconds (43.8575 KB/sec) > 17/02/01 12:02:02 INFO mapreduce.ExportJobBase: Exported 100000 records. > ~~~~~ > ------------------------ > | COUNT(*) | > ------------------------ > | 100000 | > ------------------------ > ################# > # STEP 07 - Export 10 Avro Formatted Data Using 2 Splits Using HCat Options > (workaround) > ################# > sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query > "delete from t1_text" > sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query > "select count(*) from t1_text" > beeline -u "jdbc:hive2://hs2.coe.cloudera.com:10000" -n user1 -e "use > default; drop table t1_avro; create external table t1_avro (c1 int, c2 > string, c3 string) row format delimited fields terminated by ',' stored as > avro location 'hdfs:///user/root/external/t1_avro'; select count(*) from > t1_avro;" > sqoop export --connect $MYCONN --username $MYUSER --password $MYPSWD --table > T1_TEXT --hcatalog-database default --hcatalog-table t1_avro --num-mappers 2 > sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query > "select count(*) from t1_text" > Output: > ------------------------ > | COUNT(*) | > ------------------------ > | 0 | > ------------------------ > ~~~~~ > +---------+--+ > | _c0 | > +---------+--+ > | 100000 | > +---------+--+ > ~~~~~ > 17/02/01 13:41:54 INFO mapred.FileInputFormat: Total input paths to process : > 10 > 17/02/01 13:41:54 INFO mapreduce.JobSubmitter: number of splits:2 > <================ correct! > <SNIP> > 17/02/01 13:42:34 INFO mapreduce.ExportJobBase: Transferred 2.5225 MB in > 48.7286 seconds (53.0082 KB/sec) > 17/02/01 13:42:34 INFO mapreduce.ExportJobBase: Exported 100000 records. > ~~~~~ > ------------------------ > | COUNT(*) | > ------------------------ > | 100000 | > ------------------------ > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346)