Andrew Wong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/11100 )

Change subject: Implement BloomFilter Predicate in server side.
......................................................................


Patch Set 2:

(6 comments)

Interesting feature! I just did a pass over the "big picture." The biggest 
thing I'm unsure of is why we're pushing bloom filters into range predicates. 
It seems like they are completely separate predicates, so why combine them with 
ranges?

Also it'd be helpful in reviewing if you could add some comments around some 
the new classes regarding how they are will be used (e.g. how merging works, 
etc.), as well as a high level overview of the design in the commit message.

http://gerrit.cloudera.org:8080/#/c/11100/2/src/kudu/common/column_predicate.h
File src/kudu/common/column_predicate.h:

http://gerrit.cloudera.org:8080/#/c/11100/2/src/kudu/common/column_predicate.h@97
PS2, Line 97:   static ColumnPredicate Range(ColumnSchema column, const void* 
lower, const void* upper,
            :                                std::vector<BloomFilterInner>* bfs 
= nullptr
I'm not sure range queries should be able to support BloomFilters. Is 
InBloomFilter ANDed with a Range not sufficient? If you add InBloomFilter as 
rank 4 in SelectivityRank() in column_predicate.cc, this should prioritize 
evaluating bloom filters first, and if it yields no rows, further range 
predicates should be skipped.

Regardless, it'd be helpful to add a comment on how 'bfs' works.


http://gerrit.cloudera.org:8080/#/c/11100/2/src/kudu/common/common.proto
File src/kudu/common/common.proto:

http://gerrit.cloudera.org:8080/#/c/11100/2/src/kudu/common/common.proto@351
PS2, Line 351: message BloomFilter {
One of the things that I think might be confusing in the future is there is 
this BloomFilter (should probably be renamed to BloomFilterPB), the BloomFilter 
in utili/bloom_filter.h, and BloomFilterInner. Somewhere in this patch (maybe 
the commit message), it's worth explaining what each one is and how it is 
expected to be used. Ex.

BloomFilterPB: protobuf that gets serialized and sent via predicates in scans
BloomFilterInner: in-memory representation of BloomFilterPB
BloomFilter: underlying bloom filter implementation used by the server to check 
the presence of rows


http://gerrit.cloudera.org:8080/#/c/11100/2/src/kudu/common/common.proto@352
PS2, Line 352: optional int32 nhash = 1;
             :     // We currently use CityHash backend
             :     optional bytes bloom_data = 2 [(kudu.REDACT) = true];
It's a bit unclear what these fields are. Mind adding comments explaining?

Also, since this is a part of the public API, we'd like to make sure this is 
extendible in the future (e.g. allowing for different kinds of hashes, or 
different kinds of blooms). As such, maybe define an enum to specify the 
implementation details. See common.proto for an example.


http://gerrit.cloudera.org:8080/#/c/11100/2/src/kudu/common/wire_protocol.cc
File src/kudu/common/wire_protocol.cc:

http://gerrit.cloudera.org:8080/#/c/11100/2/src/kudu/common/wire_protocol.cc@412
PS2, Line 412: // Copies a predicate bloom filter data from 'bf_src' into 
'bf_dst'.
             : void CopyPredicateBloomDataToPB(const Slice& bf_src, string* 
bf_dst) {
             :   const void* src;
             :   size_t size;
             :   src = bf_src.data();
             :   size = bf_src.size();
             :   bf_dst->assign(reinterpret_cast<const char*>(src), size);
             : }
This would be clearer if it took a BloomFilterInner instead of a Slice. Same 
with CopyPredicateBloomDataFromPB, but if it took BloomFilterPB as an 
out-parameter.


http://gerrit.cloudera.org:8080/#/c/11100/2/src/kudu/tablet/cfile_set-test.cc
File src/kudu/tablet/cfile_set-test.cc:

http://gerrit.cloudera.org:8080/#/c/11100/2/src/kudu/tablet/cfile_set-test.cc@104
PS2, Line 104: int nrows, BloomFilterBuilder* bf1_contain,
             :                        BloomFilterBuilder* bf1_exclude,
             :                        BloomFilterBuilder* bf2_contain,
             :                        BloomFilterBuilder* bf2_exclude)
Can you comment what these variables do and how they should be used? Same with 
GetBloomFilterResult().


http://gerrit.cloudera.org:8080/#/c/11100/2/src/kudu/tablet/cfile_set-test.cc@451
PS2, Line 451: TEST_F(TestCFileSet, TestBloomFilterPredicates) {
Would you mind adding a high-level summary of what this test is testing and how 
it accomplishes that? What are exclude vs contain, etc.?



--
To view, visit http://gerrit.cloudera.org:8080/11100
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I62c2de42667d0255d94e19db773240f7f9ee636c
Gerrit-Change-Number: 11100
Gerrit-PatchSet: 2
Gerrit-Owner: ZhangYao <[email protected]>
Gerrit-Reviewer: Andrew Wong <[email protected]>
Gerrit-Reviewer: Dan Burkert <[email protected]>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <[email protected]>
Gerrit-Comment-Date: Tue, 07 Aug 2018 23:44:55 +0000
Gerrit-HasComments: Yes

Reply via email to