ZhangYao has posted comments on this change. ( http://gerrit.cloudera.org:8080/11100 )
Change subject: Implement BloomFilter Predicate in server side. ...................................................................... Patch Set 8: (66 comments) http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate-test.cc File src/kudu/common/column_predicate-test.cc: http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate-test.cc@45 PS6, Line 45: public: > Can you use SeedRandom() instead? That way we'll use a different seed for e Done http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate-test.cc@77 PS6, Line 77: Random rand(SeedRandom()); : uint64_t current = 0; > Nit: BloomFilterBuilder* bfb1 Done http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate-test.cc@79 PS6, Line 79: for (int i = 0; i < 2; ++i) { > Don't need if you use SeedRandom(). Done http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate-test.cc@83 PS6, Line 83: continue; > Use util/Random instead. Done http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate-test.cc@88 PS6, Line 88: > const uint8_t* Ok, got it, thanks a lot. Done. http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate-test.cc@92 PS6, Line 92: } > emplace_back in new code. Elsewhere too. Done http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate-test.cc@791 PS6, Line 791: vector<ColumnPredicate::BloomFilterInner> orig_bloom_filters = *bf; > Don't need std:: prefix. Below too. Done http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate-test.cc@793 PS6, Line 793: // NONE > Nit: maybe name "orig_bloom_filters" to better convey its purpose? Done http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate-test.cc@828 PS6, Line 828: ColumnPredicate::IsNotNull(column), > bf_copy maybe? Done http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate-test.cc@1382 PS6, Line 1382: BloomFilterBuilder bfb1( > 0 1 both hit Done http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate-test.cc@1414 PS6, Line 1414: vector<Slice> keys_slice; > 0 1 both hit Done http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate-test.cc@1429 PS6, Line 1429: BloomFilterSizing::ByCountAndFPRate(n_keys, 0.01)); > Each of these "Test for ... type" is independent of the other tests, right? Actually it is not independent. This Test is for InBloomFilter merge and the test is formed by three types tests:uint64, string and binary just like the other predicate test, so I think they should be together. http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.h File src/kudu/common/column_predicate.h: http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.h@275 PS6, Line 275: : BloomFilterInner() : nhash_(0), hash_algorithm_(CITY_HASH) {} : : const Slice& bloom_data() const { : return bloom_data_; : } : : size_t nhash() const { : return nhash_; : } : : HashAlgorithmInBloomFilter hash_algorithm() const { : return hash_algorithm_; : } : : void set_nhash(size_t nhash) { : nhash_ = nhash; : } : : void set_bloom_data(Slice bloom_data) { : bloom_data_ = bloom_data; : } : : v > I don't see the point of these getters and setters since the members are al Emmm I simply try to use setters and getters to let the calls look uniform. Maybe I should change the struct to class. Done http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.h@301 PS6, Line 301: : bool operator==(const BloomFilterInner& other) const { > Nit: reformat like: Done http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.h@304 PS6, Line 304: nhash_ == other.nhash() && > Nit: separate each member from the member before it (or from operator==) wi Done http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.h@308 PS6, Line 308: : > I don't think this comment is necessary; the list of supported algorithms i Done http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.h@354 PS6, Line 354: into this IS NOT NULL > Nit: const ColumnPredicate& Done http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.h@425 PS6, Line 425: > Nit: "The list of bloom filters in this predicate." Done http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.cc File src/kudu/common/column_predicate.cc: http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.cc@68 PS6, Line 68: : predicate_type_(predicate_type), : column_(move(column)), : lower_(lower), : upper_(upper) { > Nit: use the same indentation as in the other two constructors. Done http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.cc@113 PS6, Line 113: CHECK(bfs != nullptr); > Should check that bfs != nullptr. Done http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.cc@189 PS6, Line 189: upper_ = nullptr; > Hmm, why don't we clear values_ in this function? Why is it necessary to cl Emm, the origin code doesn't clear the values_ and it works well because use the predicate type to judge. I conservativly clear the bloom_filters_ actually if we don't clear is also ok. Done http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.cc@226 PS6, Line 226: SetToNone(); : } : } > Could use SetToNone here. Done http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.cc@254 PS6, Line 254: return; > If we allowed an empty list of bloom filters, could that simplify into some Emmm, I think InBloomFilter predicate must contain as least one bloom filter. And allowed an empty list may not simplify the code. http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.cc@264 PS6, Line 264: upper_ = nullptr; : bloom_filters_.clear(); : } else { : SetToNone(); : } : } : } els > Can you explain the rationale behind this simplification? It's not obvious This simplification refer to the Range simplification above. Currently InBloomFilter predicate can contain range, so when we simply InBloomFilter predicate we should consider both. In this case lower_ and upper_ are consecutive which means this predicate may can simplify to Equality if lower_ hits the bloomfilters. If lower_ doesn't hit the bloomfilters this predicate should be None. http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.cc@275 PS6, Line 275: predicate_type_ = PredicateType::Equality; > Isn't upper_ guaranteed to be nullptr in this branch? Yes, I just simply copy the Range simplify part. Should I modify the Range part too? Done. http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.cc@277 PS6, Line 277: } else { : SetToNone(); : } : } : } else if (upper_ != nullptr) { : if (type_info->IsMinValue(upper_)) { : S > Also not understanding this simplification; could you explain in a comment? This simplification refer to the Range simplification above. In this case lower_ is the max value which means this predicate may can simplify to Equality if lower_ hits the bloomfilters. If lower_ doesn't hit the bloomfilters this predicate should be None. http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.cc@287 PS6, Line 287: }; > Why is this case different from the L274-L275 case? That is, why do we use If the lower_ is the min value means the value can be any value, so L274-L275 simply set lower_ to be nullptr and keep the bloomfilters intact. But if the upper_ is the min vlaue means no value can be selected so the predicate can simplify to None. http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.cc@443 PS6, Line 443: }; : } : LOG(F > Can you explain this? Equality merge InBloomFilter, if the value(lower_) of Equality can not hit the InBloomFilter(satisfy both bloomfilters and range bound if contains) then the merge result should be None. http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.cc@630 PS6, Line 630: bloom_filters_.clear(); > Nit: "bloom filter", not "list" Done http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.cc@636 PS6, Line 636: > in bloom filter Done http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.cc@647 PS6, Line 647: } > Nit: vector<const void*> new_values; Done http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.cc@650 PS6, Line 650: SetToNone(); > Nit: emplace_back in new code. Done http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.cc@870 PS6, Line 870: } > IS IN BloomFilter, perhaps? Done http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/common.proto File src/kudu/common/common.proto: http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/common.proto@357 PS6, Line 357: // Represent a bloom filter. > What does "quick filter the row" mean? Could you elaborate? My fault, the comment is not precise. Done. http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/common.proto@362 PS6, Line 362: optional bytes bloom_data = 2 [(kudu.REDACT) = true]; > Doesn't this depend on the value of hash_algorithm? Yes, done. http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/common.proto@365 PS6, Line 365: > Does this need a default value specifier for CITY_HASH? Done http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/common.proto@377 PS6, Line 377: : optional bytes lower = 1 [(kudu. > Nit: not your fault, but could you remove this sentence? We've since added Done http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/common.proto@404 PS6, Line 404: // Lower and Upper is optional for InBloomFilter. > Nit: add a comment explaining what this is. Done http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/common.proto@406 PS6, Line 406: // merge result can be InBloomFilter whith range bound inside. And the lower > How do lower and upper work with bloom filters? Can you doc in a comment? When use both InBloomFilter and Range predicate for the same column the merge result can be InBloomFilter whith range bound inside. And the lower and upper works just like they in Range predicate. Done. http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/common.proto@408 PS6, Line 408: // The inclusive lower bound. > Use field id 2 here and 3 for 'upper'; no reason to skip 2 since there wasn My fault. Done. http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/key_util.cc File src/kudu/common/key_util.cc: http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/key_util.cc@234 PS6, Line 234: case PredicateType::InBloomFilter: // Upper in InBloomFilter processed as upper in Range. > Nit: "upper in InBloomFilter processed as upper in Range" Done http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/wire_protocol-test.cc File src/kudu/common/wire_protocol-test.cc: http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/wire_protocol-test.cc@44 PS6, Line 44: #include "kudu/util/test_util.h" > Not sorted correctly. Done http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/wire_protocol-test.cc@508 PS6, Line 508: Arena arena(1024); > WireProtocolTest already has an arena, could reuse it? The above testcase use new local variable arena such as InList, so I want to keep consistent. http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/wire_protocol-test.cc@536 PS6, Line 536: NO_FATALS(ColumnPredicateToPB(ibf, &pb)); > NO_FATALS in new code. Below too. Ok, Done. http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/wire_protocol-test.cc@555 PS6, Line 555: TEST_F(WireProtocolTest, TestColumnPredicateBloomFilterWithBound) { > The setup code is identical in both tests; can you decompose it into a new The BF test is juxtaposed with InList test above, It is necessary to do this? Or I should decompose the predicate test(include the InList) into a new test fixture? http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/wire_protocol.cc File src/kudu/common/wire_protocol.cc: http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/wire_protocol.cc@416 PS6, Line 416: const void* src = bf_src.bloom_data().data(); : size_t size = bf_src.bloom_data().size(); : bf_dst->mutable_bloom_data()->assign(reinterpret_cast<const char*>(src), size); : bf_dst->set_hash_algorithm(bf_src. > Combine these four lines into two (i.e. combine declaration and assignment) Done http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/wire_protocol.cc@455 PS6, Line 455: ColumnPredicate::BloomFilterInner* dst_src, > from Done http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/wire_protocol.cc@461 PS6, Line 461: memcpy(data_copy, bf_src.bloom_data().data(), bloom_data_size); > bf_src.bloom_data().size() is repeated three times; can you store in a loca Done http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/wire_protocol.cc@463 PS6, Line 463: dst_src->set_hash_algorithm(bf_src.hash_algorithm()); > Why do we need to allocate the slice itself from the arena? Seems like we t Done http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/wire_protocol.cc@511 PS6, Line 511: > Nit: unnecessary space here. Done http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/wire_protocol.cc@514 PS6, Line 514: > Nit: ColumnPredicatePB::BloomFilter* bloom_filter Done http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/wire_protocol.cc@595 PS6, Line 595: }; : case Colum > Nit: while you're here, could you fix this indentation? Done http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/wire_protocol.cc@601 PS6, Line 601: m f > Nit: == 0 is more clear Done http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/wire_protocol.cc@603 PS6, Line 603: ter.bloom_filters()) { > Nit: no bloom filters Done http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/wire_protocol.cc@608 PS6, Line 608: bloom_filter; > Nit: "missing bloom filter details" Done http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/wire_protocol.cc@612 PS6, Line 612: // Extract the optional lower and upper bound. > emplace_back Done http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/tablet/cfile_set-test.cc File src/kudu/tablet/cfile_set-test.cc: http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/tablet/cfile_set-test.cc@21 PS6, Line 21: #include <cstddef> > Nit: no empty line here. Done http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/tablet/cfile_set-test.cc@109 PS6, Line 109: (2n > Nit: "(2n)th" so that it doesn't seem like you're trying to say "second". B Done http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/tablet/cfile_set-test.cc@110 PS6, Line 110: blo > Nit: form. Below too. Done http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/tablet/cfile_set-test.cc@113 PS6, Line 113: Builder* bf1_contain, > Nit: put this one on its own line for better symmetry with the others. Belo Done http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/tablet/cfile_set-test.cc@213 PS6, Line 213: const shared_ptr<CFileSet>& > Nit: const shared_ptr<CFileSet>& Done http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/tablet/cfile_set-test.cc@532 PS6, Line 532: // BloomFilter of column 0 contain and column 1 contain. > Nit: terminate with a period. Done http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/util/bloom_filter.h File src/kudu/util/bloom_filter.h: http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/util/bloom_filter.h@57 PS6, Line 57: explicit BloomKeyProbe(const Slice &key, HashAlgorithmInBloomFilter hash_algorithm = CITY_HASH) > Can you rewrite this constructor to chain to the new one? I chose to use constructor with default value. I can also simply delete the origin constructor and modify the calls of origin constructor. http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/util/bloom_filter.h@68 PS6, Line 68: default: > The explicit keyword is only for constructors with one argument. Done. http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/util/bloom_filter.h@69 PS6, Line 69: h = util_hash::CityHash64( > Reformat like this: Done http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/util/bloom_filter.h@76 PS6, Line 76: h_2_ = static_cast<uint32_t>(h >> 32); > Can you reformat: Done -- To view, visit http://gerrit.cloudera.org:8080/11100 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I62c2de42667d0255d94e19db773240f7f9ee636c Gerrit-Change-Number: 11100 Gerrit-PatchSet: 8 Gerrit-Owner: ZhangYao <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: Andrew Wong <[email protected]> Gerrit-Reviewer: Dan Burkert <[email protected]> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-Reviewer: ZhangYao <[email protected]> Gerrit-Comment-Date: Mon, 24 Sep 2018 09:22:18 +0000 Gerrit-HasComments: Yes
