Thomas Tauber-Marshall has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15683 )

Change subject: IMPALA-3741 [part 2]: Push runtime bloom filter to Kudu
......................................................................


Patch Set 22:

(23 comments)

The patch is looking pretty good, thanks for all the work on this!

Mostly just some nits and additional testing left.

http://gerrit.cloudera.org:8080/#/c/15683/22/be/src/codegen/gen_ir_descriptions.py
File be/src/codegen/gen_ir_descriptions.py:

http://gerrit.cloudera.org:8080/#/c/15683/22/be/src/codegen/gen_ir_descriptions.py@204
PS22, Line 204: ["RAW_VALUE_GET_HASH_VALUE",
I don't think this is being used anywhere anymore, so it can be removed.


http://gerrit.cloudera.org:8080/#/c/15683/22/be/src/exec/kudu-scanner.cc
File be/src/exec/kudu-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/15683/22/be/src/exec/kudu-scanner.cc@247
PS22, Line 247: NULL
nit: we prefer nullptr, here and elsewhere


http://gerrit.cloudera.org:8080/#/c/15683/22/be/src/kudu/util/block_bloom_filter.h
File be/src/kudu/util/block_bloom_filter.h:

http://gerrit.cloudera.org:8080/#/c/15683/22/be/src/kudu/util/block_bloom_filter.h@146
PS22, Line 146:     // It's possible that log_heap_size is less than 
kLogBucketWordBits in unit test.
We definitely want to avoid making any changes to the be/src/kudu directory if 
possible, since it makes upgrading the newer versions trickier. Is it possible 
to rewrite the tests to not trigger this?


http://gerrit.cloudera.org:8080/#/c/15683/22/be/src/runtime/raw-value.h
File be/src/runtime/raw-value.h:

http://gerrit.cloudera.org:8080/#/c/15683/22/be/src/runtime/raw-value.h@86
PS22, Line 86: GetHashValueFastHash32
I think you mean 'GetHashValueFastHash', i.e. this function doesn't return a 
32-bit value.


http://gerrit.cloudera.org:8080/#/c/15683/22/be/src/runtime/raw-value.h@93
PS22, Line 93: GetHashValueFastHash32NonNull
Same here - this doesn't return a 32-bit value, so I think you meant to say 
GetHashValueFastHashNonNull


http://gerrit.cloudera.org:8080/#/c/15683/22/be/src/util/bloom-filter-test.cc
File be/src/util/bloom-filter-test.cc:

http://gerrit.cloudera.org:8080/#/c/15683/22/be/src/util/bloom-filter-test.cc@34
PS22, Line 34: // kudu::BlockBloomFilter::kCpu is a static variable and is 
initialized once.
             : // To temporarily disable AVX2 for Bloom Filter in runtime 
testing, set flag
             : // disable_blockbloomfilter_avx2 as true. See 
kudu::BlockBloomFilter::has_avx2().
             : // kudu::BlockBloomFilter::has_avx2() is called by 
kudu::BlockBloomFilter
             : // constructor and 
kudu::BlockBloomFilter::OrEqualArrayInternal(). Impala
             : // testing code should set this flag before creating 
impala::BloomFilter object
             : // which affect impala::BloomFilter::Insert() and 
impala::BloomFilter::Find(),
             : // or before calling impala::BloomFilter::or().
             : // This flag has no effect if the target CPU doesn't support 
AVX2.
This comment is a lot longer than necessary. Probably fine to just say 
something like "This flag is used in Kudu to temporarily disable avx2 support 
for testing purposes"


http://gerrit.cloudera.org:8080/#/c/15683/22/be/src/util/bloom-filter.h
File be/src/util/bloom-filter.h:

http://gerrit.cloudera.org:8080/#/c/15683/22/be/src/util/bloom-filter.h@40
PS22, Line 40: class BlockBloomFilter;
             : class BlockBloomFilterBufferAllocatorIf;
             : class Slice;
nit: I think currently as written none of these are needed, since Slice isn't 
used here and you're importing block_bloom_filter anyways.


http://gerrit.cloudera.org:8080/#/c/15683/22/be/src/util/bloom-filter.h@70
PS22, Line 70: class ImpalaBloomFilterBufferAllocator : public 
kudu::BlockBloomFilterBufferAllocatorIf {
I think this is now only referenced in bloom-filter.cc, so we can just move the 
definition there.


http://gerrit.cloudera.org:8080/#/c/15683/22/be/src/util/bloom-filter.h@210
PS22, Line 210:  *
nit: formatting


http://gerrit.cloudera.org:8080/#/c/15683/22/be/src/util/bloom-filter.h@213
PS22, Line 213: /// Buffer allocator is owned by the bloom filter.
This comment seems weird. Maybe say something like "used by kudu to allocate 
memory for 'block_bloom_filter_'


http://gerrit.cloudera.org:8080/#/c/15683/22/be/src/util/bloom-filter.h@215
PS22, Line 215: e
nit: 'E'

and please leave an extra line between this and the definition above


http://gerrit.cloudera.org:8080/#/c/15683/22/be/src/util/bloom-filter.cc
File be/src/util/bloom-filter.cc:

http://gerrit.cloudera.org:8080/#/c/15683/22/be/src/util/bloom-filter.cc@35
PS22, Line 35: #include "util/kudu-status-util.h"
nit: please alphabetize


http://gerrit.cloudera.org:8080/#/c/15683/22/be/src/util/bloom-filter.cc@49
PS22, Line 49: delete
In general, we try to avoid using 'delete' as its error prone.

I'm not sure if there's a reason why it needs to be a pointer and not just a be 
a regular member variable of BloomFilter, but if there is then of course you 
can use unique_ptr.


http://gerrit.cloudera.org:8080/#/c/15683/22/be/src/util/bloom-filter.cc@52
PS22, Line 52: // Kudu only support FastHash for BlockBloomFilter.
             : // Since Fasthash is strictly better than Murmur Hash2, we do not
             : // support Murmur Hash2 algorithm for Bloom filter.
How about we move this information to the class header on BloomFilter in 
bloom-filter.h, i.e. when you mention that its a thin wrapper around 
BlockBloomFilter also mention that we use FAST_HASH.


http://gerrit.cloudera.org:8080/#/c/15683/22/be/src/util/bloom-filter.cc@57
PS22, Line 57: hash_seed_
I don't think this is ever actually used, so it can probably just be removed.


http://gerrit.cloudera.org:8080/#/c/15683/22/be/src/util/bloom-filter.cc@176
PS22, Line 176:
nit: formating


http://gerrit.cloudera.org:8080/#/c/15683/22/be/src/util/hash-util.h
File be/src/util/hash-util.h:

http://gerrit.cloudera.org:8080/#/c/15683/22/be/src/util/hash-util.h@30
PS22, Line 30: enum HashAlgorithm {
I don't think this is used anywhere, so lets remove it.


http://gerrit.cloudera.org:8080/#/c/15683/22/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
File fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java:

http://gerrit.cloudera.org:8080/#/c/15683/22/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java@722
PS22, Line 722: SlotRef slotRef = targetExpr.unwrapSlotRef(true);
This is unnecessary - what its doing is removing any possible casts around the 
SlotRef, but then you're checking for casts anyways in the 'if' directly below.

You can instead do a 'targetExpr instanceof SlotRef', and if true then cast 
targetExpr to a SlotRef and do the getColumn check.


http://gerrit.cloudera.org:8080/#/c/15683/22/fe/src/test/java/org/apache/impala/planner/PlannerTest.java
File fe/src/test/java/org/apache/impala/planner/PlannerTest.java:

http://gerrit.cloudera.org:8080/#/c/15683/22/fe/src/test/java/org/apache/impala/planner/PlannerTest.java@596
PS22, Line 596:   public void testKudu() {
I think it would be good to add some planner tests for the interaction between 
HDFS and Kudu scans in the same query, eg. have a bloom filter that gets 
assigned to both HDFS and Kudu scan nodes, have one that gets assigned to HDFS 
but not Kudu (eg. because its not targeting a slot ref), etc.


http://gerrit.cloudera.org:8080/#/c/15683/22/fe/src/test/java/org/apache/impala/planner/PlannerTest.java@635
PS22, Line 635:   public void testKuduTpch() {
Might be nice to set explainLevel = EXTENDED on at least one of these tests 
that includes bloom filters just to have it print out the [bloom] or [min-max] 
next to each filter in the plan.


http://gerrit.cloudera.org:8080/#/c/15683/22/testdata/workloads/functional-query/queries/QueryTest/all_runtime_filters.test
File 
testdata/workloads/functional-query/queries/QueryTest/all_runtime_filters.test:

http://gerrit.cloudera.org:8080/#/c/15683/22/testdata/workloads/functional-query/queries/QueryTest/all_runtime_filters.test@10
PS22, Line 10: 29200
Is it possible to add PROFILE sections for these first test cases like the ones 
further down have, with the aggregation over ProbeRows?

Might also be nice to think about anything else we can do to make sure these 
tests are actually doing what we think, eg. maybe also add a row_regex for the 
"n of n Runtime Filters Published" line to show that we really are generating 
both min-max and bloom filters.


http://gerrit.cloudera.org:8080/#/c/15683/22/testdata/workloads/functional-query/queries/QueryTest/all_runtime_filters.test@396
PS22, Line 396: # Test case 4: Two StringMinMaxFilters generated in the same 
fragment (IMPALA-7272)
I think this test case can probably be removed (the memory allocation in 
MinMaxFilter that its testing is covered by the equivalent test in 
min_max_filter.test already)


http://gerrit.cloudera.org:8080/#/c/15683/22/testdata/workloads/functional-query/queries/QueryTest/all_runtime_filters.test@423
PS22, Line 423: # Test case 5: SEMI_JOIN
Add a comment saying what's interesting about this test case?



--
To view, visit http://gerrit.cloudera.org:8080/15683
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9100076f68ea299ddb6ec8bc027cac7a47f5d754
Gerrit-Change-Number: 15683
Gerrit-PatchSet: 22
Gerrit-Owner: Wenzhe Zhou <[email protected]>
Gerrit-Reviewer: Aman Sinha <[email protected]>
Gerrit-Reviewer: Bankim Bhavsar <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Jim Apple <[email protected]>
Gerrit-Reviewer: Thomas Tauber-Marshall <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-Reviewer: Todd Lipcon <[email protected]>
Gerrit-Reviewer: Wenzhe Zhou <[email protected]>
Gerrit-Comment-Date: Tue, 02 Jun 2020 22:21:58 +0000
Gerrit-HasComments: Yes

Reply via email to