Daniel Becker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17026 )

Change subject: IMPALA-10640: Support reading Parquet Bloom filters - most 
common types
......................................................................


Patch Set 25:

(9 comments)

http://gerrit.cloudera.org:8080/#/c/17026/24//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/17026/24//COMMIT_MSG@7
PS24, Line 7: IMPALA-10640: Support reading Parquet Bloom filters - most common 
types
> Do we filter fields in complex types as well, e.g. elements of an array? Or
I think it can only apply to top level columns as I check whether the type of 
the expression is among the supported types for Parquet Bloom filtering (see 
them below) and if the type is a complex type, it is not included.


http://gerrit.cloudera.org:8080/#/c/17026/24/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/17026/24/be/src/exec/parquet/hdfs-parquet-scanner.cc@887
PS24, Line 887:
> We have query options to disable row group/page index filtering. Probably w
I've added the query option and put this code and the call to 
CreateColIdx2EqConjunctMap() inside an IF statement depending on the query 
option.
Do you think we should also add a test case in 
tests/query_test/test_parquet_bloom_filter.py that checks there is no Bloom 
filtering if the query option is 'false'?


http://gerrit.cloudera.org:8080/#/c/17026/24/be/src/exec/parquet/hdfs-parquet-scanner.cc@1483
PS24, Line 1483:
> nit: thanks for adding this member function. Can we use it at ParquetPageIn
I've considered that but ParquetPageIndex::ReadAll()
 uses a different overload of scan_node_->AllocateScanRange(), which also 
involves subranges, so it would be more complicated.


http://gerrit.cloudera.org:8080/#/c/17026/24/be/src/exec/parquet/hdfs-parquet-scanner.cc@1497
PS24, Line 1497: fset, partition_
> nit: fits previous line
Done


http://gerrit.cloudera.org:8080/#/c/17026/24/be/src/exec/parquet/hdfs-parquet-scanner.cc@1502
PS24, Line 1502:  n
> nit: +2 indent
Done


http://gerrit.cloudera.org:8080/#/c/17026/24/be/src/exec/parquet/hdfs-parquet-scanner.cc@1716
PS24, Line 1716:           
bloom_filter.InitFromDirectoryNoCopy(data_buffer.buffer(), data_buffer.Size()));
> nit: could you please add some logging at VLOG(3)? E.g. which conjunct filt
I included the row group idx, column idx and filename in the log message. If 
you think it's important, I can add info about the specific conjunct, but now I 
only store the hash of the value of the literal in the EQ conjunct, so at this 
point in the code we cannot identify the conjunct. If you think it is needed, I 
can store more info about the conjuncts.


http://gerrit.cloudera.org:8080/#/c/17026/24/be/src/exec/parquet/parquet-bloom-filter-util.h
File be/src/exec/parquet/parquet-bloom-filter-util.h:

http://gerrit.cloudera.org:8080/#/c/17026/24/be/src/exec/parquet/parquet-bloom-filter-util.h@37
PS24, Line 37: .
> nit: could you add some details please? When storage is used, and when it i
I do not want to guarantee in which cases '*storage' will be used as the 
implementation may change. I have updated the doc comment to hopefully make it 
clearer how we use it. Essentially, the result should always be read through 
'*ptr', and iff allocation is needed to store the result, '*storage' will be 
used to allocate a buffer and '*ptr' will point to that buffer.


http://gerrit.cloudera.org:8080/#/c/17026/24/be/src/kudu/util/block_bloom_filter.cc
File be/src/kudu/util/block_bloom_filter.cc:

http://gerrit.cloudera.org:8080/#/c/17026/24/be/src/kudu/util/block_bloom_filter.cc@135
PS24, Line 135: us
> nit: too much indentation
Done


http://gerrit.cloudera.org:8080/#/c/17026/24/be/src/thirdparty/xxhash/xxhash.h
File be/src/thirdparty/xxhash/xxhash.h:

http://gerrit.cloudera.org:8080/#/c/17026/24/be/src/thirdparty/xxhash/xxhash.h@70
PS24, Line 70: 
https://fastcompression.blogspot.com/2019/03/presenting-xxh3.html?showComment=1552696407071#c3490092340461170735
> Probably you could add extend EXCLUDE_FILE_PATTERNS to thirdparty files in
Done



--
To view, visit http://gerrit.cloudera.org:8080/17026
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7119c7161fa3658e561fc1265430cb90079d8287
Gerrit-Change-Number: 17026
Gerrit-PatchSet: 25
Gerrit-Owner: Daniel Becker <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Daniel Becker <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Tamas Mate <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Thu, 08 Apr 2021 15:57:27 +0000
Gerrit-HasComments: Yes

Reply via email to