[ 
https://issues.apache.org/jira/browse/IMPALA-11120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Quanlong Huang resolved IMPALA-11120.
-------------------------------------
    Fix Version/s: Impala 4.1.0
       Resolution: Fixed

> load-data.py does not load ORC files with specified codec
> ---------------------------------------------------------
>
>                 Key: IMPALA-11120
>                 URL: https://issues.apache.org/jira/browse/IMPALA-11120
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Infrastructure
>            Reporter: Quanlong Huang
>            Assignee: Quanlong Huang
>            Priority: Major
>             Fix For: Impala 4.1.0
>
>
> I ran the following command to generate TPC-H tables in ORC format using 
> SNAPPY compression:
> {code:java}
> bin/load-data.py -w tpch -e core --table_formats=orc/snap/block
> {code}
> After it succeeded, I realized the compression is still ZLIB:
> {code:java}
> $ hive --service orcfiledump 
> hdfs://localhost:20500/test-warehouse/tpch.lineitem_orc_snap/000000_0
> Processing data file 
> hdfs://localhost:20500/test-warehouse/tpch.lineitem_orc_snap/000000_0 
> [length: 149783256]
> Structure for 
> hdfs://localhost:20500/test-warehouse/tpch.lineitem_orc_snap/000000_0
> File Version: 0.12 with ORC_135
> Rows: 6001215
> Compression: ZLIB         <-------- not SNAPPY
> Compression size: 262144
> Calendar: Julian/Gregorian
> {code}
> The Hive statements we use to generate data are
> {code:sql}
> SET hive.exec.compress.output=true;
> SET mapred.output.compression.type=BLOCK;
> SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
> SET hive.exec.dynamic.partition.mode=nonstrict;
> SET hive.exec.dynamic.partition=true;
> SET hive.exec.max.dynamic.partitions=10000;
> SET hive.exec.max.dynamic.partitions.pernode=10000;
> set hive.auto.convert.join=true;
> SET mapred.max.split.size=256000000;
> SET hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
> INSERT OVERWRITE TABLE tpch_orc_snap.lineitem SELECT * FROM tpch.lineitem;
> {code}
> Setting mapred.output.compression.codec does not work in ORC format. Instead, 
> we need to set tblproperty "orc.compress" to "SNAPPY".
> ref: [https://orc.apache.org/docs/hive-config.html]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to