[Impala-ASF-CR] IMPALA-9470: Use Parquet Bloom filters - Part 1

Csaba Ringhofer (Code Review) Thu, 18 Mar 2021 12:45:23 -0700

Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17026 )


Change subject: IMPALA-9470: Use Parquet Bloom filters - Part 1
......................................................................


Patch Set 17:

(25 comments)

http://gerrit.cloudera.org:8080/#/c/17026/17//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/17026/17//COMMIT_MSG@7
PS17, Line 7: Part 1
Can you add some info about what is expected in later parts?


http://gerrit.cloudera.org:8080/#/c/17026/17//COMMIT_MSG@26
PS17, Line 26: Testing:
It would be great to add some unit tests for ParquetBloomFilter, especially if 
there are paths that are not used in the EE test.


http://gerrit.cloudera.org:8080/#/c/17026/17//COMMIT_MSG@28
PS17, Line 28:     Parquet Bloom filtering works for the supported types and 
that we do
Please mention that we use a Parquet file generated by some other tool. This 
info should be also added to 
https://github.com/apache/impala/blob/aeeff53e884a67ee7f5980654a1d394c6e3e34ac/testdata/data/README


http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/exec/parquet/hdfs-parquet-scanner.h
File be/src/exec/parquet/hdfs-parquet-scanner.h:

http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/exec/parquet/hdfs-parquet-scanner.h@529
PS17, Line 529: buffer_pool_client_
BufferPool::ClientHandle should be only used from a single thread:
https://github.com/apache/impala/blob/master/be/src/runtime/bufferpool/buffer-pool.h#L332

I think that can cause problems if there are multiple scanners for a single 
scan node.


http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/exec/parquet/hdfs-parquet-scanner.h@801
PS17, Line 801:   static bool IsParquetBloomFilterSupported(parquet::Type::type 
parquet_type,
There may be better places for these functions, e.g. ParquetMetadataUtils, 
ParquetCommon, or probably a separate file for Parquet bloom filter related 
stuff.


http://gerrit.cloudera.org:8080/#/c/17026/3/be/src/exec/parquet/hdfs-parquet-scanner.h
File be/src/exec/parquet/hdfs-parquet-scanner.h:

http://gerrit.cloudera.org:8080/#/c/17026/3/be/src/exec/parquet/hdfs-parquet-scanner.h@704
PS3, Line 704:   ///
It could be noted that this is read from metadata_range_.


http://gerrit.cloudera.org:8080/#/c/17026/3/be/src/exec/parquet/hdfs-parquet-scanner.h@718
PS3, Line 718:   /// Decides how to divide stream_->reservation() between the 
columns. May increase
consistency: EvalDictionaryFilters uses skip_row_group for the same purpose.


http://gerrit.cloudera.org:8080/#/c/17026/3/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/17026/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@1108
PS3, Line 1108: nst string& fn_
A few DCHECKs would be nice, e.g. to ensure that metadata_range_ is filled.


http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/exec/parquet/hdfs-parquet-scanner.cc@843
PS17, Line 843:       continue;
This means that we will skip the row group without raising an y counters. I 
think that we should process the row group if there are issues with the 
bloomfilter.


http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/exec/parquet/hdfs-parquet-scanner.cc@1470
PS17, Line 1470: FindChildSlotRef
Is this needed for any supported types? e.g. char(N) in the example shouldn't 
be supported.


http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/exec/parquet/hdfs-parquet-scanner.cc@1684
PS17, Line 1684:       const int8_t* const cast_value = reinterpret_cast<const 
int8_t*>(value);
               :       const int byte_len = 
ParquetPlainEncoder::Encode(*cast_value,
               :           -1 /* fixed_len_size */, storage->data());
               :       DCHECK_EQ(byte_len, output_len);
Create a template function for this code?


http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/exec/parquet/hdfs-parquet-scanner.cc@1729
PS17, Line 1729:     const int exp_size = 
ParquetPlainEncoder::ByteSize(*cast_value);
               :     storage->resize(exp_size);
if this was moved out, then the same template function could be used as the one 
mentioned at line 1684


http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/exec/parquet/hdfs-parquet-scanner.cc@1825
PS17, Line 1825: __isset.meta_data
We shouldn't need to check this here.


http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/exec/parquet/hdfs-parquet-scanner.cc@1834
PS17, Line 1834:             &header_size, &bloom_filter_header));
nit: too much indentation


http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/exec/parquet/hdfs-parquet-scanner.cc@1842
PS17, Line 1842:           return Status(Substitute("Could not allocate buffer 
of $0 bytes for Parquet "
nit: too much indentation


http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/exec/parquet/hdfs-parquet-scanner.cc@1857
PS17, Line 1857:             data_buffer.buffer() + data_already_read, 
data_to_read));
nit: too much indentation


http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/exprs/expr-value.h
File be/src/exprs/expr-value.h:

http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/exprs/expr-value.h@209
PS17, Line 209:   /// Returns whether the two ExprValue's are equal if they 
both have the gien type.
typo


http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/exprs/expr-value.h@252
PS17, Line 252:       case  TYPE_DATETIME:    // Not implemented
nit: double space


http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/kudu/util/block_bloom_filter.cc
File be/src/kudu/util/block_bloom_filter.cc:

http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/kudu/util/block_bloom_filter.cc@162
PS17, Line 162:         Substitute("Mismatch in BlockBloomFilter source 
directory size $0 and expected size $1",
nit: long line


http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/util/parquet-bloom-filter.h
File be/src/util/parquet-bloom-filter.h:

http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/util/parquet-bloom-filter.h@35
PS17, Line 35:   /// Reset the filter state, allocate/reallocate and initialize 
the 'directory_'. All
             :   /// calls to Insert() and Find() should only be done between 
the calls to Init() and
             :   /// Close(). Init and Close are safe to call multiple times.
             :   Status Init(const int log_bufferpool_space);
Do we ever use this functon? If not, then it could be removed.


http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/util/parquet-bloom-filter.h@80
PS17, Line 80:   /// Returns the expected false positive rate for the given ndv 
and log_bufferpool_space
nit: missing .


http://gerrit.cloudera.org:8080/#/c/17026/9/be/src/util/parquet-bloom-filter.h
File be/src/util/parquet-bloom-filter.h:

http://gerrit.cloudera.org:8080/#/c/17026/9/be/src/util/parquet-bloom-filter.h@35
PS9, Line 35:   /// Reset the filter state, allocate/reallocate and initialize 
the 'directory_'. All
            :   /// calls to Insert() and Find() should only be done between 
the calls to Init() and
            :   /// Close(). Init and Close are safe to call multiple times.
            :   Status Init(const int log_bufferpool_space);
Do we actually call this functions?


http://gerrit.cloudera.org:8080/#/c/17026/9/be/src/util/parquet-bloom-filter.cc
File be/src/util/parquet-bloom-filter.cc:

http://gerrit.cloudera.org:8080/#/c/17026/9/be/src/util/parquet-bloom-filter.cc@52
PS9, Line 52:   // We take care of hashing in this class and do not use the 
underlying Kudu
            :   // BloomFilter's hashing algorithm, so the hash algorithm and 
seed do not matter.
It could be less hacky if kudu::BlockBloomFilter was extended a bit to support 
"external hashing". E.g. it could have an extra init function and a bool 
is_external_hashing, and there could be DCHECKs in functions that cannot be 
called in case of external hashing.


http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/util/parquet-bloom-filter.cc
File be/src/util/parquet-bloom-filter.cc:

http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/util/parquet-bloom-filter.cc@36
PS17, Line 36: buffer_allocator_
If we initialize the block bloom filter without copy then the buffer_allocator_ 
won't be used (if I saw correctly). As we only use the no copy path, I think 
that buffer_allocator_ and client are not longer needed.


http://gerrit.cloudera.org:8080/#/c/17026/17/tests/query_test/test_parquet_bloom_filter.py
File tests/query_test/test_parquet_bloom_filter.py:

http://gerrit.cloudera.org:8080/#/c/17026/17/tests/query_test/test_parquet_bloom_filter.py@37
PS17, Line 37: test_deprecated_stats
should be renamed



--
To view, visit http://gerrit.cloudera.org:8080/17026
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7119c7161fa3658e561fc1265430cb90079d8287
Gerrit-Change-Number: 17026
Gerrit-PatchSet: 17
Gerrit-Owner: Daniel Becker <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Daniel Becker <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Comment-Date: Thu, 18 Mar 2021 19:45:10 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9470: Use Parquet Bloom filters - Part 1

Reply via email to