[
https://issues.apache.org/jira/browse/IMPALA-10629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Joe McDonnell resolved IMPALA-10629.
------------------------------------
Fix Version/s: Impala 4.0
Resolution: Fixed
> bin/load-data.py does not respect compression codec for parquet
> ---------------------------------------------------------------
>
> Key: IMPALA-10629
> URL: https://issues.apache.org/jira/browse/IMPALA-10629
> Project: IMPALA
> Issue Type: Bug
> Components: Infrastructure
> Affects Versions: Impala 4.0
> Reporter: Joe McDonnell
> Priority: Major
> Fix For: Impala 4.0
>
>
> If I try to use bin/load-data.py to load TPC-H as ZSTD compressed Parquet, it
> silently ignores the codec and uses Snappy under the covers:
> {noformat}
> $ bin/load-data.py -w tpch --table_formats=parquet/zstd
> $ hdfs dfs -ls /test-warehouse/tpch.lineitem_parquet_zstd/
> Found 4 items
> -rw-r--r-- 3 joe supergroup 72305126 2021-03-31 17:01
> /test-warehouse/tpch.lineitem_parquet_zstd/02444051906c734d-3b49d6c900000000_1779607968_data.0.parq
> -rw-r--r-- 3 joe supergroup 58526717 2021-03-31 17:01
> /test-warehouse/tpch.lineitem_parquet_zstd/02444051906c734d-3b49d6c900000001_53336944_data.0.parq
> -rw-r--r-- 3 joe supergroup 72584796 2021-03-31 17:01
> /test-warehouse/tpch.lineitem_parquet_zstd/02444051906c734d-3b49d6c900000002_53336944_data.0.parq
> drwxr-xr-x - joe supergroup 0 2021-03-31 17:01
> /test-warehouse/tpch.lineitem_parquet_zstd/_impala_insert_staging
> $ hdfs dfs -copyToLocal
> /test-warehouse/tpch.lineitem_parquet_zstd/02444051906c734d-3b49d6c900000002_53336944_data.0.parq
> $ parquet-reader 02444051906c734d-3b49d6c900000002_53336944_data.0.parq
> ...
> [10] = ColumnChunk {
> 02: file_offset (i64) = 37053592,
> 03: meta_data (struct) = ColumnMetaData {
> 01: type (i32) = 6,
> 02: encodings (list) = list<i32>[2] {
> [0] = 2,
> [1] = 3,
> },
> 03: path_in_schema (list) = list<string>[1] {
> [0] = "l_shipdate",
> },
> 04: codec (i32) = 1, <------ SNAPPY!!!!
> ...{noformat}
> Based on what I'm seeing, bin/load-data.py doesn't set the compression_codec
> query option when loading parquet. It is a bug that this silently does the
> wrong thing, but the actual support is more of a feature request.
> Being able to load ZSTD (or other compression) parquet makes it easier to do
> performance comparisons for those compression codecs on the perf-AB-test
> upstream job ([https://jenkins.impala.io/job/perf-AB-test/]).
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]