Daniel Becker has posted comments on this change. ( http://gerrit.cloudera.org:8080/17026 )
Change subject: IMPALA-10640: Support reading Parquet Bloom filters - most common types ...................................................................... Patch Set 25: (9 comments) http://gerrit.cloudera.org:8080/#/c/17026/24//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/17026/24//COMMIT_MSG@7 PS24, Line 7: IMPALA-10640: Support reading Parquet Bloom filters - most common types > Do we filter fields in complex types as well, e.g. elements of an array? Or I think it can only apply to top level columns as I check whether the type of the expression is among the supported types for Parquet Bloom filtering (see them below) and if the type is a complex type, it is not included. http://gerrit.cloudera.org:8080/#/c/17026/24/be/src/exec/parquet/hdfs-parquet-scanner.cc File be/src/exec/parquet/hdfs-parquet-scanner.cc: http://gerrit.cloudera.org:8080/#/c/17026/24/be/src/exec/parquet/hdfs-parquet-scanner.cc@887 PS24, Line 887: > We have query options to disable row group/page index filtering. Probably w I've added the query option and put this code and the call to CreateColIdx2EqConjunctMap() inside an IF statement depending on the query option. Do you think we should also add a test case in tests/query_test/test_parquet_bloom_filter.py that checks there is no Bloom filtering if the query option is 'false'? http://gerrit.cloudera.org:8080/#/c/17026/24/be/src/exec/parquet/hdfs-parquet-scanner.cc@1483 PS24, Line 1483: > nit: thanks for adding this member function. Can we use it at ParquetPageIn I've considered that but ParquetPageIndex::ReadAll() uses a different overload of scan_node_->AllocateScanRange(), which also involves subranges, so it would be more complicated. http://gerrit.cloudera.org:8080/#/c/17026/24/be/src/exec/parquet/hdfs-parquet-scanner.cc@1497 PS24, Line 1497: fset, partition_ > nit: fits previous line Done http://gerrit.cloudera.org:8080/#/c/17026/24/be/src/exec/parquet/hdfs-parquet-scanner.cc@1502 PS24, Line 1502: n > nit: +2 indent Done http://gerrit.cloudera.org:8080/#/c/17026/24/be/src/exec/parquet/hdfs-parquet-scanner.cc@1716 PS24, Line 1716: bloom_filter.InitFromDirectoryNoCopy(data_buffer.buffer(), data_buffer.Size())); > nit: could you please add some logging at VLOG(3)? E.g. which conjunct filt I included the row group idx, column idx and filename in the log message. If you think it's important, I can add info about the specific conjunct, but now I only store the hash of the value of the literal in the EQ conjunct, so at this point in the code we cannot identify the conjunct. If you think it is needed, I can store more info about the conjuncts. http://gerrit.cloudera.org:8080/#/c/17026/24/be/src/exec/parquet/parquet-bloom-filter-util.h File be/src/exec/parquet/parquet-bloom-filter-util.h: http://gerrit.cloudera.org:8080/#/c/17026/24/be/src/exec/parquet/parquet-bloom-filter-util.h@37 PS24, Line 37: . > nit: could you add some details please? When storage is used, and when it i I do not want to guarantee in which cases '*storage' will be used as the implementation may change. I have updated the doc comment to hopefully make it clearer how we use it. Essentially, the result should always be read through '*ptr', and iff allocation is needed to store the result, '*storage' will be used to allocate a buffer and '*ptr' will point to that buffer. http://gerrit.cloudera.org:8080/#/c/17026/24/be/src/kudu/util/block_bloom_filter.cc File be/src/kudu/util/block_bloom_filter.cc: http://gerrit.cloudera.org:8080/#/c/17026/24/be/src/kudu/util/block_bloom_filter.cc@135 PS24, Line 135: us > nit: too much indentation Done http://gerrit.cloudera.org:8080/#/c/17026/24/be/src/thirdparty/xxhash/xxhash.h File be/src/thirdparty/xxhash/xxhash.h: http://gerrit.cloudera.org:8080/#/c/17026/24/be/src/thirdparty/xxhash/xxhash.h@70 PS24, Line 70: https://fastcompression.blogspot.com/2019/03/presenting-xxh3.html?showComment=1552696407071#c3490092340461170735 > Probably you could add extend EXCLUDE_FILE_PATTERNS to thirdparty files in Done -- To view, visit http://gerrit.cloudera.org:8080/17026 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I7119c7161fa3658e561fc1265430cb90079d8287 Gerrit-Change-Number: 17026 Gerrit-PatchSet: 25 Gerrit-Owner: Daniel Becker <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Daniel Becker <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Tamas Mate <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Thu, 08 Apr 2021 15:57:27 +0000 Gerrit-HasComments: Yes
