Repository: impala Updated Branches: refs/heads/2.x 329979d6f -> 07c704aef
IMPALA-7304: Don't write floating column index until PARQUET-1222 is resolved. Impala master branch can already write the Parquet page index. However, we still don't have a well-defined ordering for floating-point numbers in Parquet, see PARQUET-1222 Currently impala writes the page index with fmax()/fmin() semantics, but it might contradicts the future semantics that will be defined once PARQUET-1222 is resolved. >From this patch Impala won't write the column index for floating-point columns until PARQUET-1222 is resolved and implemented. I updated the python test accordingly. Change-Id: I50aa2e6607de6a8943eb068b8162b0506763078b Reviewed-on: http://gerrit.cloudera.org:8080/10951 Reviewed-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com> (cherry picked from commit 041197444d2a73bc3e3da4c6dbfdf1d63c236fbf) Reviewed-on: http://gerrit.cloudera.org:8080/10960 Reviewed-by: Zoltan Borok-Nagy <borokna...@cloudera.com> Tested-by: Zoltan Borok-Nagy <borokna...@cloudera.com> Project: http://git-wip-us.apache.org/repos/asf/impala/repo Commit: http://git-wip-us.apache.org/repos/asf/impala/commit/07c704ae Tree: http://git-wip-us.apache.org/repos/asf/impala/tree/07c704ae Diff: http://git-wip-us.apache.org/repos/asf/impala/diff/07c704ae Branch: refs/heads/2.x Commit: 07c704aef3e8806198334bbf2f530293d717813f Parents: 329979d Author: Zoltan Borok-Nagy <borokna...@cloudera.com> Authored: Mon Jul 16 14:24:45 2018 +0200 Committer: Zoltan Borok-Nagy <borokna...@cloudera.com> Committed: Wed Jul 18 10:31:47 2018 +0000 ---------------------------------------------------------------------- be/src/exec/hdfs-parquet-table-writer.cc | 6 ++++++ tests/query_test/test_parquet_page_index.py | 5 +++++ 2 files changed, 11 insertions(+) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/impala/blob/07c704ae/be/src/exec/hdfs-parquet-table-writer.cc ---------------------------------------------------------------------- diff --git a/be/src/exec/hdfs-parquet-table-writer.cc b/be/src/exec/hdfs-parquet-table-writer.cc index 91a2084..8aa4f7a 100644 --- a/be/src/exec/hdfs-parquet-table-writer.cc +++ b/be/src/exec/hdfs-parquet-table-writer.cc @@ -338,10 +338,16 @@ class HdfsParquetTableWriter::ColumnWriter : plain_encoded_value_size_( ParquetPlainEncoder::EncodedByteSize(eval->root().type())) { DCHECK_NE(eval->root().type().type, TYPE_BOOLEAN); + // IMPALA-7304: Don't write column index for floating-point columns until + // PARQUET-1222 is resolved. + if (std::is_floating_point<T>::value) valid_column_index_ = false; } virtual void Reset() { BaseColumnWriter::Reset(); + // IMPALA-7304: Don't write column index for floating-point columns until + // PARQUET-1222 is resolved. + if (std::is_floating_point<T>::value) valid_column_index_ = false; // Default to dictionary encoding. If the cardinality ends up being too high, // it will fall back to plain. current_encoding_ = parquet::Encoding::PLAIN_DICTIONARY; http://git-wip-us.apache.org/repos/asf/impala/blob/07c704ae/tests/query_test/test_parquet_page_index.py ---------------------------------------------------------------------- diff --git a/tests/query_test/test_parquet_page_index.py b/tests/query_test/test_parquet_page_index.py index 0ee5d37..6235819 100644 --- a/tests/query_test/test_parquet_page_index.py +++ b/tests/query_test/test_parquet_page_index.py @@ -226,6 +226,11 @@ class TestHdfsParquetTableIndexWriter(ImpalaTestSuite): index_size = len(column_info.offset_index.page_locations) assert index_size > 0 self._validate_page_locations(column_info.offset_index.page_locations) + # IMPALA-7304: Impala doesn't write column index for floating-point columns + # until PARQUET-1222 is resolved. + if column_info.schema.type in [4, 5]: + assert column_info.column_index is None + continue self._validate_null_stats(index_size, column_info) self._validate_min_max_values(index_size, column_info) self._validate_boundary_order(column_info)