Bikramjeet Vig has uploaded a new patch set (#3). Change subject: IMPALA-5628: (Draft) Add support for reading additional Parquet Decimal Types ......................................................................
IMPALA-5628: (Draft) Add support for reading additional Parquet Decimal Types Added support for reading Parquet decimal Types encoded in : 1) Int32 2) Int64 3) Binary (Variable size byte array) A few points to look at: 1) the test table I generated to read INT64 encoded decimals shows null values for the whole row if a decimal column exists with precision less than or eqaul to 9 and encoded in Int64. If I alter table and change it to 10 or above, it displays the data correctly. One can infer that there is some issue with impala doing some internal conversion after reading it from the parquet reader, since a precision less than 10 should be stored using a Decimal4 (stores value in int32) but it gets a Decimal8 from the parquet reader. Any idea on what might be going wrong? 2) Currently the test for ParquetPlainEncoder contains templatized test methods that encode and decode a passed value. Since we had encode and decode methods for all types that we supported in both writer and reader, this was fine. now to support the types added in this commit, where only the reader supports it (hence only decode method exists), I would either have to write a completely new test method or specialize the existing test method for every permutation of <Decimal, Parquet::type>. In any case a lot of new code will be added, which option do you think is better? 3) Waiting on confirmation that column stats should ne stored with the exact same format (same <Converted Type, Primitive Type> as the column). If we do get a confirmation that this is the correct. I would have to pass a SchemaElement object of that column to ColumnStatsBase::ReadFromThrift for handeling the case where the type is <Decimal, FIXED_LEN_BYTE_ARRAY>. We would be able to use the type_length attribute to use the correctly sized DecimalXValue in DecodePlainValue for reading the stat. 4) Will perf regression test be good enough or do you think we need to do perf tests for the newly added types too? Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e --- M be/src/exec/data-source-scan-node.cc M be/src/exec/hdfs-parquet-scanner.cc M be/src/exec/hdfs-parquet-scanner.h M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-column-stats.cc M be/src/exec/parquet-column-stats.h M be/src/exec/parquet-column-stats.inline.h M be/src/exec/parquet-common.h M be/src/exec/parquet-metadata-utils.cc M be/src/exec/parquet-metadata-utils.h M be/src/exec/parquet-plain-test.cc M be/src/util/dict-encoding.h M be/src/util/dict-test.cc A testdata/data/binary_decimal_dictionary.parquet A testdata/data/binary_decimal_no_dictionary.parquet A testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-formats.test M tests/query_test/test_scanners.py 17 files changed, 550 insertions(+), 244 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/22/7822/3 -- To view, visit http://gerrit.cloudera.org:8080/7822 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e Gerrit-PatchSet: 3 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Bikramjeet Vig <[email protected]> Gerrit-Reviewer: Bikramjeet Vig <[email protected]> Gerrit-Reviewer: Lars Volker <[email protected]> Gerrit-Reviewer: Matthew Jacobs <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]>
