Thomas Tauber-Marshall has posted comments on this change. ( http://gerrit.cloudera.org:8080/15683 )
Change subject: IMPALA-3741 [part 2]: Push runtime bloom filter to Kudu ...................................................................... Patch Set 22: (23 comments) The patch is looking pretty good, thanks for all the work on this! Mostly just some nits and additional testing left. http://gerrit.cloudera.org:8080/#/c/15683/22/be/src/codegen/gen_ir_descriptions.py File be/src/codegen/gen_ir_descriptions.py: http://gerrit.cloudera.org:8080/#/c/15683/22/be/src/codegen/gen_ir_descriptions.py@204 PS22, Line 204: ["RAW_VALUE_GET_HASH_VALUE", I don't think this is being used anywhere anymore, so it can be removed. http://gerrit.cloudera.org:8080/#/c/15683/22/be/src/exec/kudu-scanner.cc File be/src/exec/kudu-scanner.cc: http://gerrit.cloudera.org:8080/#/c/15683/22/be/src/exec/kudu-scanner.cc@247 PS22, Line 247: NULL nit: we prefer nullptr, here and elsewhere http://gerrit.cloudera.org:8080/#/c/15683/22/be/src/kudu/util/block_bloom_filter.h File be/src/kudu/util/block_bloom_filter.h: http://gerrit.cloudera.org:8080/#/c/15683/22/be/src/kudu/util/block_bloom_filter.h@146 PS22, Line 146: // It's possible that log_heap_size is less than kLogBucketWordBits in unit test. We definitely want to avoid making any changes to the be/src/kudu directory if possible, since it makes upgrading the newer versions trickier. Is it possible to rewrite the tests to not trigger this? http://gerrit.cloudera.org:8080/#/c/15683/22/be/src/runtime/raw-value.h File be/src/runtime/raw-value.h: http://gerrit.cloudera.org:8080/#/c/15683/22/be/src/runtime/raw-value.h@86 PS22, Line 86: GetHashValueFastHash32 I think you mean 'GetHashValueFastHash', i.e. this function doesn't return a 32-bit value. http://gerrit.cloudera.org:8080/#/c/15683/22/be/src/runtime/raw-value.h@93 PS22, Line 93: GetHashValueFastHash32NonNull Same here - this doesn't return a 32-bit value, so I think you meant to say GetHashValueFastHashNonNull http://gerrit.cloudera.org:8080/#/c/15683/22/be/src/util/bloom-filter-test.cc File be/src/util/bloom-filter-test.cc: http://gerrit.cloudera.org:8080/#/c/15683/22/be/src/util/bloom-filter-test.cc@34 PS22, Line 34: // kudu::BlockBloomFilter::kCpu is a static variable and is initialized once. : // To temporarily disable AVX2 for Bloom Filter in runtime testing, set flag : // disable_blockbloomfilter_avx2 as true. See kudu::BlockBloomFilter::has_avx2(). : // kudu::BlockBloomFilter::has_avx2() is called by kudu::BlockBloomFilter : // constructor and kudu::BlockBloomFilter::OrEqualArrayInternal(). Impala : // testing code should set this flag before creating impala::BloomFilter object : // which affect impala::BloomFilter::Insert() and impala::BloomFilter::Find(), : // or before calling impala::BloomFilter::or(). : // This flag has no effect if the target CPU doesn't support AVX2. This comment is a lot longer than necessary. Probably fine to just say something like "This flag is used in Kudu to temporarily disable avx2 support for testing purposes" http://gerrit.cloudera.org:8080/#/c/15683/22/be/src/util/bloom-filter.h File be/src/util/bloom-filter.h: http://gerrit.cloudera.org:8080/#/c/15683/22/be/src/util/bloom-filter.h@40 PS22, Line 40: class BlockBloomFilter; : class BlockBloomFilterBufferAllocatorIf; : class Slice; nit: I think currently as written none of these are needed, since Slice isn't used here and you're importing block_bloom_filter anyways. http://gerrit.cloudera.org:8080/#/c/15683/22/be/src/util/bloom-filter.h@70 PS22, Line 70: class ImpalaBloomFilterBufferAllocator : public kudu::BlockBloomFilterBufferAllocatorIf { I think this is now only referenced in bloom-filter.cc, so we can just move the definition there. http://gerrit.cloudera.org:8080/#/c/15683/22/be/src/util/bloom-filter.h@210 PS22, Line 210: * nit: formatting http://gerrit.cloudera.org:8080/#/c/15683/22/be/src/util/bloom-filter.h@213 PS22, Line 213: /// Buffer allocator is owned by the bloom filter. This comment seems weird. Maybe say something like "used by kudu to allocate memory for 'block_bloom_filter_' http://gerrit.cloudera.org:8080/#/c/15683/22/be/src/util/bloom-filter.h@215 PS22, Line 215: e nit: 'E' and please leave an extra line between this and the definition above http://gerrit.cloudera.org:8080/#/c/15683/22/be/src/util/bloom-filter.cc File be/src/util/bloom-filter.cc: http://gerrit.cloudera.org:8080/#/c/15683/22/be/src/util/bloom-filter.cc@35 PS22, Line 35: #include "util/kudu-status-util.h" nit: please alphabetize http://gerrit.cloudera.org:8080/#/c/15683/22/be/src/util/bloom-filter.cc@49 PS22, Line 49: delete In general, we try to avoid using 'delete' as its error prone. I'm not sure if there's a reason why it needs to be a pointer and not just a be a regular member variable of BloomFilter, but if there is then of course you can use unique_ptr. http://gerrit.cloudera.org:8080/#/c/15683/22/be/src/util/bloom-filter.cc@52 PS22, Line 52: // Kudu only support FastHash for BlockBloomFilter. : // Since Fasthash is strictly better than Murmur Hash2, we do not : // support Murmur Hash2 algorithm for Bloom filter. How about we move this information to the class header on BloomFilter in bloom-filter.h, i.e. when you mention that its a thin wrapper around BlockBloomFilter also mention that we use FAST_HASH. http://gerrit.cloudera.org:8080/#/c/15683/22/be/src/util/bloom-filter.cc@57 PS22, Line 57: hash_seed_ I don't think this is ever actually used, so it can probably just be removed. http://gerrit.cloudera.org:8080/#/c/15683/22/be/src/util/bloom-filter.cc@176 PS22, Line 176: nit: formating http://gerrit.cloudera.org:8080/#/c/15683/22/be/src/util/hash-util.h File be/src/util/hash-util.h: http://gerrit.cloudera.org:8080/#/c/15683/22/be/src/util/hash-util.h@30 PS22, Line 30: enum HashAlgorithm { I don't think this is used anywhere, so lets remove it. http://gerrit.cloudera.org:8080/#/c/15683/22/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java File fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java: http://gerrit.cloudera.org:8080/#/c/15683/22/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java@722 PS22, Line 722: SlotRef slotRef = targetExpr.unwrapSlotRef(true); This is unnecessary - what its doing is removing any possible casts around the SlotRef, but then you're checking for casts anyways in the 'if' directly below. You can instead do a 'targetExpr instanceof SlotRef', and if true then cast targetExpr to a SlotRef and do the getColumn check. http://gerrit.cloudera.org:8080/#/c/15683/22/fe/src/test/java/org/apache/impala/planner/PlannerTest.java File fe/src/test/java/org/apache/impala/planner/PlannerTest.java: http://gerrit.cloudera.org:8080/#/c/15683/22/fe/src/test/java/org/apache/impala/planner/PlannerTest.java@596 PS22, Line 596: public void testKudu() { I think it would be good to add some planner tests for the interaction between HDFS and Kudu scans in the same query, eg. have a bloom filter that gets assigned to both HDFS and Kudu scan nodes, have one that gets assigned to HDFS but not Kudu (eg. because its not targeting a slot ref), etc. http://gerrit.cloudera.org:8080/#/c/15683/22/fe/src/test/java/org/apache/impala/planner/PlannerTest.java@635 PS22, Line 635: public void testKuduTpch() { Might be nice to set explainLevel = EXTENDED on at least one of these tests that includes bloom filters just to have it print out the [bloom] or [min-max] next to each filter in the plan. http://gerrit.cloudera.org:8080/#/c/15683/22/testdata/workloads/functional-query/queries/QueryTest/all_runtime_filters.test File testdata/workloads/functional-query/queries/QueryTest/all_runtime_filters.test: http://gerrit.cloudera.org:8080/#/c/15683/22/testdata/workloads/functional-query/queries/QueryTest/all_runtime_filters.test@10 PS22, Line 10: 29200 Is it possible to add PROFILE sections for these first test cases like the ones further down have, with the aggregation over ProbeRows? Might also be nice to think about anything else we can do to make sure these tests are actually doing what we think, eg. maybe also add a row_regex for the "n of n Runtime Filters Published" line to show that we really are generating both min-max and bloom filters. http://gerrit.cloudera.org:8080/#/c/15683/22/testdata/workloads/functional-query/queries/QueryTest/all_runtime_filters.test@396 PS22, Line 396: # Test case 4: Two StringMinMaxFilters generated in the same fragment (IMPALA-7272) I think this test case can probably be removed (the memory allocation in MinMaxFilter that its testing is covered by the equivalent test in min_max_filter.test already) http://gerrit.cloudera.org:8080/#/c/15683/22/testdata/workloads/functional-query/queries/QueryTest/all_runtime_filters.test@423 PS22, Line 423: # Test case 5: SEMI_JOIN Add a comment saying what's interesting about this test case? -- To view, visit http://gerrit.cloudera.org:8080/15683 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9100076f68ea299ddb6ec8bc027cac7a47f5d754 Gerrit-Change-Number: 15683 Gerrit-PatchSet: 22 Gerrit-Owner: Wenzhe Zhou <[email protected]> Gerrit-Reviewer: Aman Sinha <[email protected]> Gerrit-Reviewer: Bankim Bhavsar <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Jim Apple <[email protected]> Gerrit-Reviewer: Thomas Tauber-Marshall <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-Reviewer: Wenzhe Zhou <[email protected]> Gerrit-Comment-Date: Tue, 02 Jun 2020 22:21:58 +0000 Gerrit-HasComments: Yes
