Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/11100 )
Change subject: Implement BloomFilter Predicate in server side. ...................................................................... Patch Set 2: (6 comments) Interesting feature! I just did a pass over the "big picture." The biggest thing I'm unsure of is why we're pushing bloom filters into range predicates. It seems like they are completely separate predicates, so why combine them with ranges? Also it'd be helpful in reviewing if you could add some comments around some the new classes regarding how they are will be used (e.g. how merging works, etc.), as well as a high level overview of the design in the commit message. http://gerrit.cloudera.org:8080/#/c/11100/2/src/kudu/common/column_predicate.h File src/kudu/common/column_predicate.h: http://gerrit.cloudera.org:8080/#/c/11100/2/src/kudu/common/column_predicate.h@97 PS2, Line 97: static ColumnPredicate Range(ColumnSchema column, const void* lower, const void* upper, : std::vector<BloomFilterInner>* bfs = nullptr I'm not sure range queries should be able to support BloomFilters. Is InBloomFilter ANDed with a Range not sufficient? If you add InBloomFilter as rank 4 in SelectivityRank() in column_predicate.cc, this should prioritize evaluating bloom filters first, and if it yields no rows, further range predicates should be skipped. Regardless, it'd be helpful to add a comment on how 'bfs' works. http://gerrit.cloudera.org:8080/#/c/11100/2/src/kudu/common/common.proto File src/kudu/common/common.proto: http://gerrit.cloudera.org:8080/#/c/11100/2/src/kudu/common/common.proto@351 PS2, Line 351: message BloomFilter { One of the things that I think might be confusing in the future is there is this BloomFilter (should probably be renamed to BloomFilterPB), the BloomFilter in utili/bloom_filter.h, and BloomFilterInner. Somewhere in this patch (maybe the commit message), it's worth explaining what each one is and how it is expected to be used. Ex. BloomFilterPB: protobuf that gets serialized and sent via predicates in scans BloomFilterInner: in-memory representation of BloomFilterPB BloomFilter: underlying bloom filter implementation used by the server to check the presence of rows http://gerrit.cloudera.org:8080/#/c/11100/2/src/kudu/common/common.proto@352 PS2, Line 352: optional int32 nhash = 1; : // We currently use CityHash backend : optional bytes bloom_data = 2 [(kudu.REDACT) = true]; It's a bit unclear what these fields are. Mind adding comments explaining? Also, since this is a part of the public API, we'd like to make sure this is extendible in the future (e.g. allowing for different kinds of hashes, or different kinds of blooms). As such, maybe define an enum to specify the implementation details. See common.proto for an example. http://gerrit.cloudera.org:8080/#/c/11100/2/src/kudu/common/wire_protocol.cc File src/kudu/common/wire_protocol.cc: http://gerrit.cloudera.org:8080/#/c/11100/2/src/kudu/common/wire_protocol.cc@412 PS2, Line 412: // Copies a predicate bloom filter data from 'bf_src' into 'bf_dst'. : void CopyPredicateBloomDataToPB(const Slice& bf_src, string* bf_dst) { : const void* src; : size_t size; : src = bf_src.data(); : size = bf_src.size(); : bf_dst->assign(reinterpret_cast<const char*>(src), size); : } This would be clearer if it took a BloomFilterInner instead of a Slice. Same with CopyPredicateBloomDataFromPB, but if it took BloomFilterPB as an out-parameter. http://gerrit.cloudera.org:8080/#/c/11100/2/src/kudu/tablet/cfile_set-test.cc File src/kudu/tablet/cfile_set-test.cc: http://gerrit.cloudera.org:8080/#/c/11100/2/src/kudu/tablet/cfile_set-test.cc@104 PS2, Line 104: int nrows, BloomFilterBuilder* bf1_contain, : BloomFilterBuilder* bf1_exclude, : BloomFilterBuilder* bf2_contain, : BloomFilterBuilder* bf2_exclude) Can you comment what these variables do and how they should be used? Same with GetBloomFilterResult(). http://gerrit.cloudera.org:8080/#/c/11100/2/src/kudu/tablet/cfile_set-test.cc@451 PS2, Line 451: TEST_F(TestCFileSet, TestBloomFilterPredicates) { Would you mind adding a high-level summary of what this test is testing and how it accomplishes that? What are exclude vs contain, etc.? -- To view, visit http://gerrit.cloudera.org:8080/11100 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I62c2de42667d0255d94e19db773240f7f9ee636c Gerrit-Change-Number: 11100 Gerrit-PatchSet: 2 Gerrit-Owner: ZhangYao <[email protected]> Gerrit-Reviewer: Andrew Wong <[email protected]> Gerrit-Reviewer: Dan Burkert <[email protected]> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-Comment-Date: Tue, 07 Aug 2018 23:44:55 +0000 Gerrit-HasComments: Yes
