[jira] [Commented] (IMPALA-11120) load-data.py does not load ORC files with specified codec

ASF subversion and git services (Jira) Tue, 01 Mar 2022 18:02:06 -0800


    [ 
https://issues.apache.org/jira/browse/IMPALA-11120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17499841#comment-17499841
 ]


ASF subversion and git services commented on IMPALA-11120:
----------------------------------------------------------

Commit b2e4b29f06141ad34eef2cbadfda259124792ac2 in impala's branch 
refs/heads/master from stiga-huang
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=b2e4b29 ]

IMPALA-11120: Fix codec not set in generating ORC tables

We use 'mapred.output.compression.codec' to set the compression codec in
generating test files by Hive. However, it doesn't affect ORC files.
Instead, we need to set 'orc.compress' in tblproperties for each ORC
tables. The default value of 'orc.compress' is ZLIB which corresponds to
our 'def' codec. We only need to set it for non-def codecs.

This patch also fixes a bug in build_compression_codec_statement() that
would raise KeyError when loading lz4 non-avro tables.

Tests
 - Loaded tpch data in orc/none/none, orc/def/block, orc/snap/block,
   orc/lz4/block and verified there compression codecs.

Change-Id: I02bd5d9400864145133ff019a3d076a6cab36fcc
Reviewed-on: http://gerrit.cloudera.org:8080/18228
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> load-data.py does not load ORC files with specified codec
> ---------------------------------------------------------
>
>                 Key: IMPALA-11120
>                 URL: https://issues.apache.org/jira/browse/IMPALA-11120
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Infrastructure
>            Reporter: Quanlong Huang
>            Assignee: Quanlong Huang
>            Priority: Major
>
> I ran the following command to generate TPC-H tables in ORC format using 
> SNAPPY compression:
> {code:java}
> bin/load-data.py -w tpch -e core --table_formats=orc/snap/block
> {code}
> After it succeeded, I realized the compression is still ZLIB:
> {code:java}
> $ hive --service orcfiledump 
> hdfs://localhost:20500/test-warehouse/tpch.lineitem_orc_snap/000000_0
> Processing data file 
> hdfs://localhost:20500/test-warehouse/tpch.lineitem_orc_snap/000000_0 
> [length: 149783256]
> Structure for 
> hdfs://localhost:20500/test-warehouse/tpch.lineitem_orc_snap/000000_0
> File Version: 0.12 with ORC_135
> Rows: 6001215
> Compression: ZLIB         <-------- not SNAPPY
> Compression size: 262144
> Calendar: Julian/Gregorian
> {code}
> The Hive statements we use to generate data are
> {code:sql}
> SET hive.exec.compress.output=true;
> SET mapred.output.compression.type=BLOCK;
> SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
> SET hive.exec.dynamic.partition.mode=nonstrict;
> SET hive.exec.dynamic.partition=true;
> SET hive.exec.max.dynamic.partitions=10000;
> SET hive.exec.max.dynamic.partitions.pernode=10000;
> set hive.auto.convert.join=true;
> SET mapred.max.split.size=256000000;
> SET hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
> INSERT OVERWRITE TABLE tpch_orc_snap.lineitem SELECT * FROM tpch.lineitem;
> {code}
> Setting mapred.output.compression.codec does not work in ORC format. Instead, 
> we need to set tblproperty "orc.compress" to "SNAPPY".
> ref: [https://orc.apache.org/docs/hive-config.html]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (IMPALA-11120) load-data.py does not load ORC files with specified codec

Reply via email to