[ 
https://issues.apache.org/jira/browse/SQOOP-3130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15854178#comment-15854178
 ] 

Markus Kemper commented on SQOOP-3130:
--------------------------------------

>From reviewing this issue further another possible workaround is to set the 
>(mapreduce.input.fileinputformat.split.minsize) to help control the count of 
>map tasks used when exporting Avro data files.  This method is not likely to 
>be 100% deterministic if the volume of data is larger that the boundaries of 
>the "minsize" and count of map tasks requested however it is more ideal than 
>one map task per Avro data file.  The example below demonstrates this 
>workaround.

*Use (mapreduce.input.fileinputformat.split.minsize) without \--num-mappers*
{noformat}
sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query 
"delete from t1_text"
sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query 
"select count(*) from t1_text"
sqoop export -Dmapreduce.input.fileinputformat.split.minsize=134217728 
--connect $MYCONN --username $MYUSER --password $MYPSWD --table T1_TEXT 
--export-dir /user/root/external/t1_text --input-fields-terminated-by ','

17/02/06 05:56:25 INFO input.FileInputFormat: Total input paths to process : 10
17/02/06 05:56:25 INFO mapreduce.JobSubmitter: number of splits:4 <=========== 
uses default --num-mappers value (4)
<SNIP>
17/02/06 05:57:11 INFO mapreduce.ExportJobBase: Transferred 3.519 MB in 47.9893 
seconds (75.0887 KB/sec)
17/02/06 05:57:11 INFO mapreduce.ExportJobBase: Exported 100000 records.
{noformat}

*Use (mapreduce.input.fileinputformat.split.minsize) with \--num-mappers*
{noformat}
sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query 
"delete from t1_text"
sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query 
"select count(*) from t1_text"
sqoop export -Dmapreduce.input.fileinputformat.split.minsize=134217728 
--connect $MYCONN --username $MYUSER --password $MYPSWD --table T1_TEXT 
--export-dir /user/root/external/t1_text --input-fields-terminated-by ',' 
--num-mappers 2

17/02/06 05:59:12 INFO input.FileInputFormat: Total input paths to process : 10
17/02/06 05:59:12 INFO mapreduce.JobSubmitter: number of splits:2 <=========== 
uses requested --num-mappers value (2) 
<SNIP>
17/02/06 05:59:42 INFO mapreduce.ExportJobBase: Transferred 3.5189 MB in 
32.3481 seconds (111.3925 KB/sec)
17/02/06 05:59:42 INFO mapreduce.ExportJobBase: Exported 100000 records.
{noformat}

> Sqoop (export + --export-dir + Avro files) is not obeying --num-mappers 
> requested
> ---------------------------------------------------------------------------------
>
>                 Key: SQOOP-3130
>                 URL: https://issues.apache.org/jira/browse/SQOOP-3130
>             Project: Sqoop
>          Issue Type: Bug
>            Reporter: Markus Kemper
>
> When using Sqoop (export + --export-dir + Avro files) it is not obeying 
> --num-mappers requested, instead it is creating a map task per Avro data 
> file.  
> One known workaround for this issue is to use the Sqoop --hcatalog options.
> Please see the test case below demonstrating the issue and workaround.
> *Test Case*
> {noformat}
> #################
> # STEP 01 - Create Data
> #################
> for i in {1..100000}; do d=`date +"%Y-%m-%d %H:%M:%S" --date="+$i days"`; 
> echo "$i,$d,row data" >> ./data.csv; done
> ls -l ./*;
> wc data.csv 
> hdfs dfs -mkdir -p /user/root/external/t1
> hdfs dfs -put ./data.csv /user/root/external/t1/data.csv
> hdfs dfs -ls -R /user/root/external/t1/
> Output:
> -rw-r--r-- 1 root root 3488895 Feb  1 11:20 ./data.csv
> ~~~~~
> 100000  300000 3488895 data.csv
> ~~~~~
> -rw-r--r--   3 root root    3488895 2017-02-01 11:26 
> /user/root/external/t1/data.csv
> #################
> # STEP 02 - Create RDBMS Table and Export Data
> #################
> export MYCONN=jdbc:oracle:thin:@oracle.cloudera.com:1521/db11g;
> export MYUSER=sqoop
> export MYPSWD=cloudera
> sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query 
> "drop table t1_text"
> sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query 
> "create table t1_text (c1 int, c2 date, c3 varchar(10))"
> sqoop export --connect $MYCONN --username $MYUSER --password $MYPSWD --table 
> T1_TEXT --export-dir /user/root/external/t1 --input-fields-terminated-by ',' 
> --num-mappers 1
> sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query 
> "select count(*) from t1_text"
> Output:
> 17/02/01 11:33:31 INFO mapreduce.ExportJobBase: Transferred 3.3274 MB in 
> 24.8037 seconds (137.3688 KB/sec)
> 17/02/01 11:33:31 INFO mapreduce.ExportJobBase: Exported 100000 records.
> ~~~~~~
> ------------------------
> | COUNT(*)             | 
> ------------------------
> | 100000               | 
> ------------------------
> #################
> # STEP 03 - Import Data as Text Creating 10 HDFS Files
> #################
> sqoop import --connect $MYCONN --username $MYUSER --password $MYPSWD --table 
> T1_TEXT --target-dir /user/root/external/t1_text --delete-target-dir 
> --num-mappers 10 --split-by C1 --as-textfile
> hdfs dfs -ls /user/root/external/t1_text/part*
> Output:
> 17/02/01 11:38:26 INFO mapreduce.ImportJobBase: Transferred 3.518 MB in 
> 57.0517 seconds (63.1434 KB/sec)
> 17/02/01 11:38:26 INFO mapreduce.ImportJobBase: Retrieved 100000 records.
> ~~~~~
> -rw-r--r--   3 root root     358894 2017-02-01 11:38 
> /user/root/external/t1_text/part-m-00000
> -rw-r--r--   3 root root     370000 2017-02-01 11:38 
> /user/root/external/t1_text/part-m-00001
> -rw-r--r--   3 root root     370000 2017-02-01 11:38 
> /user/root/external/t1_text/part-m-00002
> -rw-r--r--   3 root root     370000 2017-02-01 11:38 
> /user/root/external/t1_text/part-m-00003
> -rw-r--r--   3 root root     370000 2017-02-01 11:38 
> /user/root/external/t1_text/part-m-00004
> -rw-r--r--   3 root root     370000 2017-02-01 11:38 
> /user/root/external/t1_text/part-m-00005
> -rw-r--r--   3 root root     370000 2017-02-01 11:38 
> /user/root/external/t1_text/part-m-00006
> -rw-r--r--   3 root root     370000 2017-02-01 11:38 
> /user/root/external/t1_text/part-m-00007
> -rw-r--r--   3 root root     370000 2017-02-01 11:38 
> /user/root/external/t1_text/part-m-00008
> -rw-r--r--   3 root root     370001 2017-02-01 11:38 
> /user/root/external/t1_text/part-m-00009
> #################
> # STEP 04 - Export 10 Text Formatted Data Using 2 Splits
> #################
> sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query 
> "delete from t1_text"
> sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query 
> "select count(*) from t1_text"
> sqoop export --connect $MYCONN --username $MYUSER --password $MYPSWD --table 
> T1_TEXT --export-dir /user/root/external/t1_text --input-fields-terminated-by 
> ',' --num-mappers 2
> sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query 
> "select count(*) from t1_text"
> Output:
> ------------------------
> | COUNT(*)             | 
> ------------------------
> | 0                    | 
> ------------------------
> ~~~~~
> 17/02/01 11:47:26 INFO input.FileInputFormat: Total input paths to process : 
> 10
> 17/02/01 11:47:26 INFO mapreduce.JobSubmitter: number of splits:2
> <SNIP>
> 17/02/01 11:47:55 INFO mapreduce.ExportJobBase: Transferred 3.5189 MB in 
> 31.7104 seconds (113.6324 KB/sec)
> 17/02/01 11:47:55 INFO mapreduce.ExportJobBase: Exported 100000 records.
> ~~~~~
> ------------------------
> | COUNT(*)             | 
> ------------------------
> | 100000               | 
> ------------------------
> #################
> # STEP 05 - Import Data as Avro Creating 10 HDFS Files
> #################
> sqoop import --connect $MYCONN --username $MYUSER --password $MYPSWD --table 
> T1_TEXT --target-dir /user/root/external/t1_avro --delete-target-dir 
> --num-mappers 10 --split-by C1 --as-avrodatafile
> hdfs dfs -ls /user/root/external/t1_avro/*.avro
> Output:
> 17/02/01 11:57:38 INFO mapreduce.ImportJobBase: Transferred 2.3703 MB in 
> 68.454 seconds (35.4568 KB/sec)
> 17/02/01 11:57:38 INFO mapreduce.ImportJobBase: Retrieved 100000 records.
> ~~~~~
> -rw-r--r--   3 root root     231119 2017-02-01 11:57 
> /user/root/external/t1_avro/part-m-00000.avro
> -rw-r--r--   3 root root     250477 2017-02-01 11:57 
> /user/root/external/t1_avro/part-m-00001.avro
> -rw-r--r--   3 root root     250477 2017-02-01 11:57 
> /user/root/external/t1_avro/part-m-00002.avro
> -rw-r--r--   3 root root     250477 2017-02-01 11:57 
> /user/root/external/t1_avro/part-m-00003.avro
> -rw-r--r--   3 root root     250477 2017-02-01 11:57 
> /user/root/external/t1_avro/part-m-00004.avro
> -rw-r--r--   3 root root     250477 2017-02-01 11:57 
> /user/root/external/t1_avro/part-m-00005.avro
> -rw-r--r--   3 root root     250477 2017-02-01 11:57 
> /user/root/external/t1_avro/part-m-00006.avro
> -rw-r--r--   3 root root     250477 2017-02-01 11:57 
> /user/root/external/t1_avro/part-m-00007.avro
> -rw-r--r--   3 root root     250477 2017-02-01 11:57 
> /user/root/external/t1_avro/part-m-00008.avro
> -rw-r--r--   3 root root     250478 2017-02-01 11:56 
> /user/root/external/t1_avro/part-m-00009.avro
> #################
> # STEP 06 - Export 10 Avro Formatted Data Using 2 Splits (reproduction of 
> issue)
> #################
> sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query 
> "delete from t1_text"
> sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query 
> "select count(*) from t1_text"
> sqoop export --connect $MYCONN --username $MYUSER --password $MYPSWD --table 
> T1_TEXT --export-dir /user/root/external/t1_avro --input-fields-terminated-by 
> ',' --num-mappers 2
> sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query 
> "select count(*) from t1_text"
> Output:
> 17/02/01 12:01:07 INFO input.FileInputFormat: Total input paths to process : 
> 10
> 17/02/01 12:01:08 INFO mapreduce.JobSubmitter: number of splits:10 
> <================== not correct, should be only 2 not 10
> <SNIP>
> 17/02/01 12:02:02 INFO mapreduce.ExportJobBase: Transferred 2.4497 MB in 
> 57.1965 seconds (43.8575 KB/sec)
> 17/02/01 12:02:02 INFO mapreduce.ExportJobBase: Exported 100000 records.
> ~~~~~
> ------------------------
> | COUNT(*)             | 
> ------------------------
> | 100000               | 
> ------------------------
> #################
> # STEP 07 - Export 10 Avro Formatted Data Using 2 Splits Using HCat Options 
> (workaround)
> #################
> sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query 
> "delete from t1_text"
> sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query 
> "select count(*) from t1_text"
> beeline -u "jdbc:hive2://hs2.coe.cloudera.com:10000" -n user1 -e "use 
> default; drop table t1_avro; create external table t1_avro (c1 int, c2 
> string, c3 string) row format delimited fields terminated by ',' stored as 
> avro location 'hdfs:///user/root/external/t1_avro'; select count(*) from 
> t1_avro;"
> sqoop export --connect $MYCONN --username $MYUSER --password $MYPSWD --table 
> T1_TEXT --hcatalog-database default --hcatalog-table t1_avro --num-mappers 2  
> sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query 
> "select count(*) from t1_text"
> Output:
> ------------------------
> | COUNT(*)             | 
> ------------------------
> | 0                    | 
> ------------------------
> ~~~~~
> +---------+--+
> |   _c0   |
> +---------+--+
> | 100000  |
> +---------+--+
> ~~~~~
> 17/02/01 13:41:54 INFO mapred.FileInputFormat: Total input paths to process : 
> 10
> 17/02/01 13:41:54 INFO mapreduce.JobSubmitter: number of splits:2 
> <================ correct!
> <SNIP>
> 17/02/01 13:42:34 INFO mapreduce.ExportJobBase: Transferred 2.5225 MB in 
> 48.7286 seconds (53.0082 KB/sec)
> 17/02/01 13:42:34 INFO mapreduce.ExportJobBase: Exported 100000 records.
> ~~~~~
> ------------------------
> | COUNT(*)             | 
> ------------------------
> | 100000               | 
> ------------------------
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to