ZhangYao has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/11100 )

Change subject: Implement BloomFilter Predicate in server side.
......................................................................


Patch Set 8:

(66 comments)

http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate-test.cc
File src/kudu/common/column_predicate-test.cc:

http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate-test.cc@45
PS6, Line 45:  public:
> Can you use SeedRandom() instead? That way we'll use a different seed for e
Done


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate-test.cc@77
PS6, Line 77:     Random rand(SeedRandom());
            :     uint64_t current = 0;
> Nit: BloomFilterBuilder* bfb1
Done


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate-test.cc@79
PS6, Line 79:     for (int i = 0; i < 2; ++i) {
> Don't need if you use SeedRandom().
Done


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate-test.cc@83
PS6, Line 83:           continue;
> Use util/Random instead.
Done


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate-test.cc@88
PS6, Line 88:
> const uint8_t*
Ok, got it, thanks a lot. Done.


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate-test.cc@92
PS6, Line 92:       }
> emplace_back in new code. Elsewhere too.
Done


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate-test.cc@791
PS6, Line 791:     vector<ColumnPredicate::BloomFilterInner> orig_bloom_filters 
= *bf;
> Don't need std:: prefix. Below too.
Done


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate-test.cc@793
PS6, Line 793:     // NONE
> Nit: maybe name "orig_bloom_filters" to better convey its purpose?
Done


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate-test.cc@828
PS6, Line 828:               ColumnPredicate::IsNotNull(column),
> bf_copy maybe?
Done


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate-test.cc@1382
PS6, Line 1382:   BloomFilterBuilder bfb1(
> 0 1 both hit
Done


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate-test.cc@1414
PS6, Line 1414:   vector<Slice> keys_slice;
> 0 1 both hit
Done


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate-test.cc@1429
PS6, Line 1429:           BloomFilterSizing::ByCountAndFPRate(n_keys, 0.01));
> Each of these "Test for ... type" is independent of the other tests, right?
Actually it is not independent. This Test is for InBloomFilter merge and the 
test is formed by three types tests:uint64, string and binary just like the 
other predicate test, so I think they should be together.


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.h
File src/kudu/common/column_predicate.h:

http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.h@275
PS6, Line 275:
             :     BloomFilterInner() : nhash_(0), hash_algorithm_(CITY_HASH) {}
             :
             :     const Slice& bloom_data() const {
             :       return bloom_data_;
             :     }
             :
             :     size_t nhash() const {
             :       return nhash_;
             :     }
             :
             :     HashAlgorithmInBloomFilter hash_algorithm() const {
             :       return hash_algorithm_;
             :     }
             :
             :     void set_nhash(size_t nhash) {
             :       nhash_ = nhash;
             :     }
             :
             :     void set_bloom_data(Slice bloom_data) {
             :       bloom_data_ = bloom_data;
             :     }
             :
             :     v
> I don't see the point of these getters and setters since the members are al
Emmm I simply try to use setters and getters to let the calls look uniform. 
Maybe I should change the struct to class. Done


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.h@301
PS6, Line 301:
             :     bool operator==(const BloomFilterInner& other) const {
> Nit: reformat like:
Done


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.h@304
PS6, Line 304:               nhash_ == other.nhash() &&
> Nit: separate each member from the member before it (or from operator==) wi
Done


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.h@308
PS6, Line 308:
             :
> I don't think this comment is necessary; the list of supported algorithms i
Done


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.h@354
PS6, Line 354:  into this IS NOT NULL
> Nit: const ColumnPredicate&
Done


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.h@425
PS6, Line 425:
> Nit: "The list of bloom filters in this predicate."
Done


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.cc
File src/kudu/common/column_predicate.cc:

http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.cc@68
PS6, Line 68:     : predicate_type_(predicate_type),
            :       column_(move(column)),
            :       lower_(lower),
            :       upper_(upper) {
> Nit: use the same indentation as in the other two constructors.
Done


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.cc@113
PS6, Line 113:   CHECK(bfs != nullptr);
> Should check that bfs != nullptr.
Done


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.cc@189
PS6, Line 189:   upper_ = nullptr;
> Hmm, why don't we clear values_ in this function? Why is it necessary to cl
Emm, the origin code doesn't clear the values_ and it works well because use 
the predicate type to judge. I conservativly clear the bloom_filters_ actually 
if we don't clear is also ok. Done


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.cc@226
PS6, Line 226:           SetToNone();
             :         }
             :       }
> Could use SetToNone here.
Done


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.cc@254
PS6, Line 254:         return;
> If we allowed an empty list of bloom filters, could that simplify into some
Emmm, I think InBloomFilter predicate must contain as least one bloom filter. 
And allowed an empty list may not simplify the code.


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.cc@264
PS6, Line 264:             upper_ = nullptr;
             :             bloom_filters_.clear();
             :           } else {
             :             SetToNone();
             :           }
             :         }
             :       } els
> Can you explain the rationale behind this simplification? It's not obvious
This simplification refer to the Range simplification above. Currently 
InBloomFilter predicate can contain range, so when we simply InBloomFilter 
predicate we should consider both. In this case lower_ and upper_ are 
consecutive which means this predicate may can simplify to Equality if lower_ 
hits the bloomfilters. If lower_ doesn't hit the bloomfilters this predicate 
should be None.


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.cc@275
PS6, Line 275:             predicate_type_ = PredicateType::Equality;
> Isn't upper_ guaranteed to be nullptr in this branch?
Yes, I just simply copy the Range simplify part. Should I modify the Range part 
too? Done.


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.cc@277
PS6, Line 277:           } else {
             :             SetToNone();
             :           }
             :         }
             :       } else if (upper_ != nullptr) {
             :         if (type_info->IsMinValue(upper_)) {
             :           S
> Also not understanding this simplification; could you explain in a comment?
This simplification refer to the Range simplification above. In this case 
lower_ is the max value which means this predicate may can simplify to Equality 
if lower_ hits the bloomfilters. If lower_ doesn't hit the bloomfilters this 
predicate should be None.


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.cc@287
PS6, Line 287:     };
> Why is this case different from the L274-L275 case? That is, why do we use
If the lower_ is the min value means the value can be any value, so L274-L275 
simply set lower_ to be nullptr and keep the bloomfilters intact. But if the 
upper_ is the min vlaue means no value can be selected so the predicate can 
simplify to None.


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.cc@443
PS6, Line 443:     };
             :   }
             :   LOG(F
> Can you explain this?
Equality merge InBloomFilter, if the value(lower_) of Equality can not hit the 
InBloomFilter(satisfy both bloomfilters and range bound if contains) then the 
merge result should be None.


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.cc@630
PS6, Line 630:         bloom_filters_.clear();
> Nit: "bloom filter", not "list"
Done


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.cc@636
PS6, Line 636:
> in bloom filter
Done


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.cc@647
PS6, Line 647:         }
> Nit: vector<const void*> new_values;
Done


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.cc@650
PS6, Line 650:         SetToNone();
> Nit: emplace_back in new code.
Done


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.cc@870
PS6, Line 870: }
> IS IN BloomFilter, perhaps?
Done


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/common.proto
File src/kudu/common/common.proto:

http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/common.proto@357
PS6, Line 357:   // Represent a bloom filter.
> What does "quick filter the row" mean? Could you elaborate?
My fault, the comment is not precise. Done.


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/common.proto@362
PS6, Line 362:     optional bytes bloom_data = 2 [(kudu.REDACT) = true];
> Doesn't this depend on the value of hash_algorithm?
Yes, done.


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/common.proto@365
PS6, Line 365:
> Does this need a default value specifier for CITY_HASH?
Done


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/common.proto@377
PS6, Line 377:
             :     optional bytes lower = 1 [(kudu.
> Nit: not your fault, but could you remove this sentence? We've since added
Done


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/common.proto@404
PS6, Line 404:     // Lower and Upper is optional for InBloomFilter.
> Nit: add a comment explaining what this is.
Done


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/common.proto@406
PS6, Line 406:     // merge result can be InBloomFilter whith range bound 
inside. And the lower
> How do lower and upper work with bloom filters? Can you doc in a comment?
When use both InBloomFilter and Range predicate for the same column the merge 
result can be InBloomFilter whith range bound inside. And the lower and upper 
works just like they in Range predicate. Done.


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/common.proto@408
PS6, Line 408:     // The inclusive lower bound.
> Use field id 2 here and 3 for 'upper'; no reason to skip 2 since there wasn
My fault. Done.


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/key_util.cc
File src/kudu/common/key_util.cc:

http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/key_util.cc@234
PS6, Line 234:       case PredicateType::InBloomFilter:  // Upper in 
InBloomFilter processed as upper in Range.
> Nit: "upper in InBloomFilter processed as upper in Range"
Done


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/wire_protocol-test.cc
File src/kudu/common/wire_protocol-test.cc:

http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/wire_protocol-test.cc@44
PS6, Line 44: #include "kudu/util/test_util.h"
> Not sorted correctly.
Done


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/wire_protocol-test.cc@508
PS6, Line 508:   Arena arena(1024);
> WireProtocolTest already has an arena, could reuse it?
The above testcase use new local variable arena such as InList, so I want to 
keep consistent.


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/wire_protocol-test.cc@536
PS6, Line 536:     NO_FATALS(ColumnPredicateToPB(ibf, &pb));
> NO_FATALS in new code. Below too.
Ok, Done.


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/wire_protocol-test.cc@555
PS6, Line 555: TEST_F(WireProtocolTest, 
TestColumnPredicateBloomFilterWithBound) {
> The setup code is identical in both tests; can you decompose it into a new
The BF test is juxtaposed with InList test above, It is necessary to do this? 
Or I should decompose the predicate test(include the InList) into a new test 
fixture?


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/wire_protocol.cc
File src/kudu/common/wire_protocol.cc:

http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/wire_protocol.cc@416
PS6, Line 416:   const void* src = bf_src.bloom_data().data();
             :   size_t size = bf_src.bloom_data().size();
             :   bf_dst->mutable_bloom_data()->assign(reinterpret_cast<const 
char*>(src), size);
             :   bf_dst->set_hash_algorithm(bf_src.
> Combine these four lines into two (i.e. combine declaration and assignment)
Done


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/wire_protocol.cc@455
PS6, Line 455:                                       
ColumnPredicate::BloomFilterInner* dst_src,
> from
Done


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/wire_protocol.cc@461
PS6, Line 461:   memcpy(data_copy, bf_src.bloom_data().data(), bloom_data_size);
> bf_src.bloom_data().size() is repeated three times; can you store in a loca
Done


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/wire_protocol.cc@463
PS6, Line 463:   dst_src->set_hash_algorithm(bf_src.hash_algorithm());
> Why do we need to allocate the slice itself from the arena? Seems like we t
Done


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/wire_protocol.cc@511
PS6, Line 511:
> Nit: unnecessary space here.
Done


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/wire_protocol.cc@514
PS6, Line 514:
> Nit: ColumnPredicatePB::BloomFilter* bloom_filter
Done


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/wire_protocol.cc@595
PS6, Line 595:     };
             :     case Colum
> Nit: while you're here, could you fix this indentation?
Done


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/wire_protocol.cc@601
PS6, Line 601: m f
> Nit: == 0 is more clear
Done


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/wire_protocol.cc@603
PS6, Line 603: ter.bloom_filters()) {
> Nit: no bloom filters
Done


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/wire_protocol.cc@608
PS6, Line 608: bloom_filter;
> Nit: "missing bloom filter details"
Done


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/wire_protocol.cc@612
PS6, Line 612:       // Extract the optional lower and upper bound.
> emplace_back
Done


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/tablet/cfile_set-test.cc
File src/kudu/tablet/cfile_set-test.cc:

http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/tablet/cfile_set-test.cc@21
PS6, Line 21: #include <cstddef>
> Nit: no empty line here.
Done


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/tablet/cfile_set-test.cc@109
PS6, Line 109: (2n
> Nit: "(2n)th" so that it doesn't seem like you're trying to say "second". B
Done


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/tablet/cfile_set-test.cc@110
PS6, Line 110:  blo
> Nit: form. Below too.
Done


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/tablet/cfile_set-test.cc@113
PS6, Line 113: Builder* bf1_contain,
> Nit: put this one on its own line for better symmetry with the others. Belo
Done


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/tablet/cfile_set-test.cc@213
PS6, Line 213: const shared_ptr<CFileSet>&
> Nit: const shared_ptr<CFileSet>&
Done


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/tablet/cfile_set-test.cc@532
PS6, Line 532:   // BloomFilter of column 0 contain and column 1 contain.
> Nit: terminate with a period.
Done


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/util/bloom_filter.h
File src/kudu/util/bloom_filter.h:

http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/util/bloom_filter.h@57
PS6, Line 57:   explicit BloomKeyProbe(const Slice &key, 
HashAlgorithmInBloomFilter hash_algorithm = CITY_HASH)
> Can you rewrite this constructor to chain to the new one?
I chose to use constructor with default value. I can also simply delete the 
origin constructor and modify the calls of origin constructor.


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/util/bloom_filter.h@68
PS6, Line 68:       default:
> The explicit keyword is only for constructors with one argument.
Done.


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/util/bloom_filter.h@69
PS6, Line 69:         h = util_hash::CityHash64(
> Reformat like this:
Done


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/util/bloom_filter.h@76
PS6, Line 76:     h_2_ = static_cast<uint32_t>(h >> 32);
> Can you reformat:
Done



--
To view, visit http://gerrit.cloudera.org:8080/11100
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I62c2de42667d0255d94e19db773240f7f9ee636c
Gerrit-Change-Number: 11100
Gerrit-PatchSet: 8
Gerrit-Owner: ZhangYao <[email protected]>
Gerrit-Reviewer: Adar Dembo <[email protected]>
Gerrit-Reviewer: Andrew Wong <[email protected]>
Gerrit-Reviewer: Dan Burkert <[email protected]>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <[email protected]>
Gerrit-Reviewer: ZhangYao <[email protected]>
Gerrit-Comment-Date: Mon, 24 Sep 2018 09:22:18 +0000
Gerrit-HasComments: Yes

Reply via email to