Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/17026 )
Change subject: IMPALA-9470: Use Parquet Bloom filters - Part 1 ...................................................................... Patch Set 17: (25 comments) http://gerrit.cloudera.org:8080/#/c/17026/17//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/17026/17//COMMIT_MSG@7 PS17, Line 7: Part 1 Can you add some info about what is expected in later parts? http://gerrit.cloudera.org:8080/#/c/17026/17//COMMIT_MSG@26 PS17, Line 26: Testing: It would be great to add some unit tests for ParquetBloomFilter, especially if there are paths that are not used in the EE test. http://gerrit.cloudera.org:8080/#/c/17026/17//COMMIT_MSG@28 PS17, Line 28: Parquet Bloom filtering works for the supported types and that we do Please mention that we use a Parquet file generated by some other tool. This info should be also added to https://github.com/apache/impala/blob/aeeff53e884a67ee7f5980654a1d394c6e3e34ac/testdata/data/README http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/exec/parquet/hdfs-parquet-scanner.h File be/src/exec/parquet/hdfs-parquet-scanner.h: http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/exec/parquet/hdfs-parquet-scanner.h@529 PS17, Line 529: buffer_pool_client_ BufferPool::ClientHandle should be only used from a single thread: https://github.com/apache/impala/blob/master/be/src/runtime/bufferpool/buffer-pool.h#L332 I think that can cause problems if there are multiple scanners for a single scan node. http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/exec/parquet/hdfs-parquet-scanner.h@801 PS17, Line 801: static bool IsParquetBloomFilterSupported(parquet::Type::type parquet_type, There may be better places for these functions, e.g. ParquetMetadataUtils, ParquetCommon, or probably a separate file for Parquet bloom filter related stuff. http://gerrit.cloudera.org:8080/#/c/17026/3/be/src/exec/parquet/hdfs-parquet-scanner.h File be/src/exec/parquet/hdfs-parquet-scanner.h: http://gerrit.cloudera.org:8080/#/c/17026/3/be/src/exec/parquet/hdfs-parquet-scanner.h@704 PS3, Line 704: /// It could be noted that this is read from metadata_range_. http://gerrit.cloudera.org:8080/#/c/17026/3/be/src/exec/parquet/hdfs-parquet-scanner.h@718 PS3, Line 718: /// Decides how to divide stream_->reservation() between the columns. May increase consistency: EvalDictionaryFilters uses skip_row_group for the same purpose. http://gerrit.cloudera.org:8080/#/c/17026/3/be/src/exec/parquet/hdfs-parquet-scanner.cc File be/src/exec/parquet/hdfs-parquet-scanner.cc: http://gerrit.cloudera.org:8080/#/c/17026/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@1108 PS3, Line 1108: nst string& fn_ A few DCHECKs would be nice, e.g. to ensure that metadata_range_ is filled. http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/exec/parquet/hdfs-parquet-scanner.cc File be/src/exec/parquet/hdfs-parquet-scanner.cc: http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/exec/parquet/hdfs-parquet-scanner.cc@843 PS17, Line 843: continue; This means that we will skip the row group without raising an y counters. I think that we should process the row group if there are issues with the bloomfilter. http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/exec/parquet/hdfs-parquet-scanner.cc@1470 PS17, Line 1470: FindChildSlotRef Is this needed for any supported types? e.g. char(N) in the example shouldn't be supported. http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/exec/parquet/hdfs-parquet-scanner.cc@1684 PS17, Line 1684: const int8_t* const cast_value = reinterpret_cast<const int8_t*>(value); : const int byte_len = ParquetPlainEncoder::Encode(*cast_value, : -1 /* fixed_len_size */, storage->data()); : DCHECK_EQ(byte_len, output_len); Create a template function for this code? http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/exec/parquet/hdfs-parquet-scanner.cc@1729 PS17, Line 1729: const int exp_size = ParquetPlainEncoder::ByteSize(*cast_value); : storage->resize(exp_size); if this was moved out, then the same template function could be used as the one mentioned at line 1684 http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/exec/parquet/hdfs-parquet-scanner.cc@1825 PS17, Line 1825: __isset.meta_data We shouldn't need to check this here. http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/exec/parquet/hdfs-parquet-scanner.cc@1834 PS17, Line 1834: &header_size, &bloom_filter_header)); nit: too much indentation http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/exec/parquet/hdfs-parquet-scanner.cc@1842 PS17, Line 1842: return Status(Substitute("Could not allocate buffer of $0 bytes for Parquet " nit: too much indentation http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/exec/parquet/hdfs-parquet-scanner.cc@1857 PS17, Line 1857: data_buffer.buffer() + data_already_read, data_to_read)); nit: too much indentation http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/exprs/expr-value.h File be/src/exprs/expr-value.h: http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/exprs/expr-value.h@209 PS17, Line 209: /// Returns whether the two ExprValue's are equal if they both have the gien type. typo http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/exprs/expr-value.h@252 PS17, Line 252: case TYPE_DATETIME: // Not implemented nit: double space http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/kudu/util/block_bloom_filter.cc File be/src/kudu/util/block_bloom_filter.cc: http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/kudu/util/block_bloom_filter.cc@162 PS17, Line 162: Substitute("Mismatch in BlockBloomFilter source directory size $0 and expected size $1", nit: long line http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/util/parquet-bloom-filter.h File be/src/util/parquet-bloom-filter.h: http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/util/parquet-bloom-filter.h@35 PS17, Line 35: /// Reset the filter state, allocate/reallocate and initialize the 'directory_'. All : /// calls to Insert() and Find() should only be done between the calls to Init() and : /// Close(). Init and Close are safe to call multiple times. : Status Init(const int log_bufferpool_space); Do we ever use this functon? If not, then it could be removed. http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/util/parquet-bloom-filter.h@80 PS17, Line 80: /// Returns the expected false positive rate for the given ndv and log_bufferpool_space nit: missing . http://gerrit.cloudera.org:8080/#/c/17026/9/be/src/util/parquet-bloom-filter.h File be/src/util/parquet-bloom-filter.h: http://gerrit.cloudera.org:8080/#/c/17026/9/be/src/util/parquet-bloom-filter.h@35 PS9, Line 35: /// Reset the filter state, allocate/reallocate and initialize the 'directory_'. All : /// calls to Insert() and Find() should only be done between the calls to Init() and : /// Close(). Init and Close are safe to call multiple times. : Status Init(const int log_bufferpool_space); Do we actually call this functions? http://gerrit.cloudera.org:8080/#/c/17026/9/be/src/util/parquet-bloom-filter.cc File be/src/util/parquet-bloom-filter.cc: http://gerrit.cloudera.org:8080/#/c/17026/9/be/src/util/parquet-bloom-filter.cc@52 PS9, Line 52: // We take care of hashing in this class and do not use the underlying Kudu : // BloomFilter's hashing algorithm, so the hash algorithm and seed do not matter. It could be less hacky if kudu::BlockBloomFilter was extended a bit to support "external hashing". E.g. it could have an extra init function and a bool is_external_hashing, and there could be DCHECKs in functions that cannot be called in case of external hashing. http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/util/parquet-bloom-filter.cc File be/src/util/parquet-bloom-filter.cc: http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/util/parquet-bloom-filter.cc@36 PS17, Line 36: buffer_allocator_ If we initialize the block bloom filter without copy then the buffer_allocator_ won't be used (if I saw correctly). As we only use the no copy path, I think that buffer_allocator_ and client are not longer needed. http://gerrit.cloudera.org:8080/#/c/17026/17/tests/query_test/test_parquet_bloom_filter.py File tests/query_test/test_parquet_bloom_filter.py: http://gerrit.cloudera.org:8080/#/c/17026/17/tests/query_test/test_parquet_bloom_filter.py@37 PS17, Line 37: test_deprecated_stats should be renamed -- To view, visit http://gerrit.cloudera.org:8080/17026 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I7119c7161fa3658e561fc1265430cb90079d8287 Gerrit-Change-Number: 17026 Gerrit-PatchSet: 17 Gerrit-Owner: Daniel Becker <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Daniel Becker <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Comment-Date: Thu, 18 Mar 2021 19:45:10 +0000 Gerrit-HasComments: Yes
