Csaba Ringhofer has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/17262 )

Change subject: IMPALA-10642: Write support for Parquet Bloom filters - most 
common types
......................................................................

IMPALA-10642: Write support for Parquet Bloom filters - most common types

This change adds support for writing Parquet Bloom filters for the types
for which read support was added in IMPALA-10640.

Writing of Parquet Bloom filters can be controlled by the
'parquet_bloom_filter_write' query option and the
'parquet.bloom.filter.columns' table property. The query option has the
following possible values:
  NEVER      - never write Parquet Bloom filters
  IF_NO_DICT - write Parquet Bloom filters if specified in the table
               properties AND if the row group is not fully
               dictionary encoded (the number of distinct values exceeds
               the maximum dictionary size)
  ALWAYS     - always write Parquet Bloom filters if specified in the
               table properties, even if the row group is fully
               dictionary encoded

The 'parquet.bloom.filter.columns' table property is a comma separated
list of 'col_name:bytes' pairs. The 'bytes' part means the size of the
bitset of the Bloom filter, and is optional. If the size is not given,
it will be the maximal Bloom filter size
(ParquetBloomFilter::MAX_BYTES).
Example: "col1:1024,col2,col4:100'.

Testing:
  - Added a test in tests/query_test/test_parquet_bloom_filter.py that
    uses Impala to write the same table as in the test file
    'testdata/data/parquet-bloom-filtering.parquet' and checks whether
    the Parquet Bloom filter header and bitset are identical.
  - 'test_fallback_from_dict' tests falling back from dict encoding to
    plain and using Bloom filters.
  - 'test_fallback_from_dict_if_no_bloom_tbl_props' tests falling back
    from dict encoding to plain when Bloom filters are NOT enabled.

Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Reviewed-on: http://gerrit.cloudera.org:8080/17262
Reviewed-by: Impala Public Jenkins <[email protected]>
Reviewed-by: Csaba Ringhofer <[email protected]>
Tested-by: Csaba Ringhofer <[email protected]>
---
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-sink.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.h
M be/src/exec/parquet/parquet-bloom-filter-util.cc
M be/src/exec/parquet/parquet-bloom-filter-util.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/CMakeLists.txt
M be/src/util/debug-util.cc
M be/src/util/debug-util.h
M be/src/util/dict-encoding.h
A be/src/util/parquet-bloom-filter-avx2.cc
M be/src/util/parquet-bloom-filter-test.cc
M be/src/util/parquet-bloom-filter.cc
M be/src/util/parquet-bloom-filter.h
M common/thrift/DataSinks.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java
A 
fe/src/test/java/org/apache/impala/planner/ParquetBloomFilterTblPropParserTest.java
M tests/query_test/test_parquet_bloom_filter.py
23 files changed, 997 insertions(+), 82 deletions(-)

Approvals:
  Impala Public Jenkins: Looks good to me, approved
  Csaba Ringhofer: Looks good to me, approved; Verified

--
To view, visit http://gerrit.cloudera.org:8080/17262
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Gerrit-Change-Number: 17262
Gerrit-PatchSet: 27
Gerrit-Owner: Daniel Becker <[email protected]>
Gerrit-Reviewer: Amogh Margoor <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Daniel Becker <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Tamas Mate <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>

Reply via email to