[ https://issues.apache.org/jira/browse/SQOOP-3068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Szabolcs Vasas reassigned SQOOP-3068: ------------------------------------- Assignee: Szabolcs Vasas > Enhance error (tool.ImportTool: Encountered IOException running import job: > java.io.IOException: Expected schema) to suggest workaround > (--map-column-java) > ----------------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: SQOOP-3068 > URL: https://issues.apache.org/jira/browse/SQOOP-3068 > Project: Sqoop > Issue Type: Improvement > Components: hive-integration > Affects Versions: 1.4.6 > Reporter: Markus Kemper > Assignee: Szabolcs Vasas > > Please consider enhancing the error to include more detail and suggest > workaround (--map-columns-java). > Sqoop (import + --hive-import + --as-parquetfile) can fail due to a mismatch > with the json schema that Hive produces vs. the json schema that Sqoop > generates. The test case below demonstrates how to reproduce the issue as > well as workaround it. > SETUP (create parquet table with Sqoop import and Beeline CTAS) > {noformat} > STEP 01 - Create MySQL Tables > sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query > "drop table t1" > sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query > "create table t1 (c_int int, c_date date, c_timestamp timestamp)" > sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query > "describe t1" > --------------------------------------------------------------------------------------------------------- > | Field | Type | Null | Key | Default > | Extra | > --------------------------------------------------------------------------------------------------------- > | c_int | int(11) | YES | | (null) > | | > | c_date | date | YES | | (null) > | | > | c_timestamp | timestamp | NO | | CURRENT_TIMESTAMP > | on update CURRENT_TIMESTAMP | > --------------------------------------------------------------------------------------------------------- > STEP 02 : Insert and Select Row > sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query > "insert into t1 values (1, current_date(), current_timestamp())" > sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query > "select * from t1" > -------------------------------------------------- > | c_int | c_date | c_timestamp | > -------------------------------------------------- > | 1 | 2016-10-26 | 2016-10-26 14:30:33.0 | > -------------------------------------------------- > STEP 03 : Create Hive Tables > beeline -u jdbc:hive2:// -e "use default; drop table t1" > sqoop import -Dmapreduce.map.log.level=DEBUG --connect $MYCONN --username > $MYUSER --password $MYPSWD --table t1 --hcatalog-database default > --hcatalog-table t1_text --create-hcatalog-table --hcatalog-storage-stanza > 'stored as parquet' --num-mappers 1 > beeline -u jdbc:hive2:// -e "use default; create table t1 stored as parquet > as select * from t1_text;show create table t1;" > +----------------------------------------------------+--+ > | createtab_stmt | > +----------------------------------------------------+--+ > | CREATE TABLE `t1`( | > | `c_int` int, | > | `c_date` string, | > | `c_timestamp` string) | > | ROW FORMAT SERDE | > | 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' | > | STORED AS INPUTFORMAT | > | 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' | > | OUTPUTFORMAT | > | 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' | > | LOCATION | > | 'hdfs://<namenode>:8020/user/hive/warehouse/t1' | > | TBLPROPERTIES ( | > | 'COLUMN_STATS_ACCURATE'='true', | > | 'numFiles'='1', | > | 'numRows'='2', | > | 'rawDataSize'='6', | > | 'totalSize'='605', | > | 'transient_lastDdlTime'='1478298298') | > +----------------------------------------------------+--+ > {noformat} > REPRODUCE ISSUE (import --hive-import append and overwrite) > {noformat} > STEP 01: Attempt --hive-import --append > > sqoop import -Dmapreduce.map.log.level=DEBUG --connect $MYCONN --username > $MYUSER --password $MYPSWD --table t1 --hive-import --hive-database default > --hive-table t1 --as-parquetfile --num-mappers 1 --append > 16/11/02 20:54:23 WARN mapreduce.DataDrivenImportJob: Target Hive table 't1' > exists! Sqoop will append data into the existing Hive table. Consider using > --hive-overwrite, if you do NOT intend to do appending. > 16/11/02 20:54:24 ERROR tool.ImportTool: Encountered IOException running > import job: java.io.IOException: Expected schema: > {"type":"record","name":"t1","fields":[{"name":"c_int","type":["null","int"],"doc":"Converted > from > 'int'","default":null},{"name":"c_date","type":["null","string"],"doc":"Converted > from > 'string'","default":null},{"name":"c_timestamp","type":["null","string"],"doc":"Converted > from 'string'","default":null}]} > Actual schema: {"type":"record","name":"t1","doc":"Sqoop import of > t1","fields":[{"name":"c_int","type":["null","int"],"default":null,"columnName":"c_int","sqlType":"4"},{"name":"c_date","type":["null","long"],"default":null,"columnName":"c_date","sqlType":"91"},{"name":"c_timestamp","type":["null","long"],"default":null,"columnName":"c_timestamp","sqlType":"93"}],"tableName":"t1"} > STEP 02: Attempt --hive-import --hive-overwrite > sqoop import -Dmapreduce.map.log.level=DEBUG --connect $MYCONN --username > $MYUSER --password $MYPSWD --table t1 --hive-import --hive-database default > --hive-table t1 --as-parquetfile --num-mappers 1 --hive-overwrite > 16/11/02 20:56:55 INFO hive.metastore: Connected to metastore. > 16/11/02 20:56:56 ERROR tool.ImportTool: Encountered IOException running > import job: java.io.IOException: Expected schema: > {"type":"record","name":"t1","fields":[{"name":"c_int","type":["null","int"],"doc":"Converted > from > 'int'","default":null},{"name":"c_date","type":["null","string"],"doc":"Converted > from > 'string'","default":null},{"name":"c_timestamp","type":["null","string"],"doc":"Converted > from 'string'","default":null}]} > Actual schema: {"type":"record","name":"t1","doc":"Sqoop import of > t1","fields":[{"name":"c_int","type":["null","int"],"default":null,"columnName":"c_int","sqlType":"4"},{"name":"c_date","type":["null","long"],"default":null,"columnName":"c_date","sqlType":"91"},{"name":"c_timestamp","type":["null","long"],"default":null,"columnName":"c_timestamp","sqlType":"93"}],"tableName":"t1"} > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)