[Impala-ASF-CR] IMPALA-4252: Min-max runtime filters for Kudu

Tim Armstrong (Code Review) Thu, 28 Sep 2017 13:59:41 -0700

Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7793 )


Change subject: IMPALA-4252: Min-max runtime filters for Kudu
......................................................................


Patch Set 5:

(33 comments)

I did a pass over the backend part of it. Haven't gotten into all the 
nitty-gritty yet. The overall approach seems to make sense, had some thoughts 
on the classes and various parts of the implementation.

http://gerrit.cloudera.org:8080/#/c/7793/5//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/7793/5//COMMIT_MSG@9
PS5, Line 9: This patch implements min-max filters for runtime filters. Each
I had a general terminology question that popped up as I read through. Does 
"runtime filter" mean a composition of a bloom filter and min-max filter, or 
does "runtime filter" mean either a bloom filter or a min-max filter.


http://gerrit.cloudera.org:8080/#/c/7793/5//COMMIT_MSG@30
PS5, Line 30: - Still needs more e2e tests.
I'm sure you have more tests in mind. I was thinking it would be important to 
test filters on every supported data type.


http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/exec/filter-context.h
File be/src/exec/filter-context.h:

http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/exec/filter-context.h@80
PS5, Line 80: /// FilterContext contains all metadata for a single runtime 
filter, and allows the filter
This seems to be an example of a place where "runtime filter" could potential 
mean the composition of a bloom and min-max filter.


http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/exec/filter-context.h@107
PS5, Line 107:   /// Returns true if this FilterContext should build a min-max 
filter, if possible.
Can you add a blank line between methods?


http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/exec/filter-context.cc
File be/src/exec/filter-context.cc:

http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/exec/filter-context.cc@422
PS5, Line 422:   if (filter_expr->type().type != PrimitiveType::TYPE_DECIMAL) {
Why is decimal special here?


http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/exec/filter-context.cc@434
PS5, Line 434:     switch (filter_expr->type().type) {
Can we make this a lookup table? Seems like the only difference between the 
cases is an enum value and a string value.


http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/exec/hdfs-parquet-scanner.cc@401
PS5, Line 401:     if (filter->BloomAlwaysTrue()
Can we phrase the method in a way that doesn't depend on the assumption that 
Parquet only handles bloom filters (since this method shouldn't really care 
about the internals of the filter). Does filter->AlwaysTrue() make sense? 
Because if there's no min-max filter and the bloom filter is always true, then 
the filter as a whole is always true.


http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/exec/kudu-scanner.cc
File be/src/exec/kudu-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/exec/kudu-scanner.cc@174
PS5, Line 174:       // TODO: if the filter is always false, we can skip this 
entire scan.
How important is this and is it blocked by anything else? If it's impactful 
might be best to just get it done and other of the way.


http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/exec/kudu-scanner.cc@175
PS5, Line 175: NULL
nullptr


http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/exec/kudu-scanner.cc@177
PS5, Line 177: dynamic_cast
dynamic_cast seems unnecessary since the type should always be correct - can we 
just use static_cast?


http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/exec/kudu-scanner.cc@179
PS5, Line 179: string
const string&?


http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/exec/kudu-util.cc
File be/src/exec/kudu-util.cc:

http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/exec/kudu-util.cc@231
PS5, Line 231:       const TimestampValue* tv = reinterpret_cast<const 
TimestampValue*>(value);
This timestamp code is a bit subtle. Is there a sane way to combine the 
conversion logic with WriteKuduTimestampValue() above?


http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/exec/partitioned-hash-join-builder.cc
File be/src/exec/partitioned-hash-join-builder.cc:

http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/exec/partitioned-hash-join-builder.cc@548
PS5, Line 548: num_enabled_filters
Is this meant to double-count the filters if there is a bloom and min-max 
filter on the same column? I'm not sure what the right thing to do here is but 
it does seem based on the class structure that we consider a matching pair of 
bloom filter + min max filter to be a single "runtime filter".


http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/runtime/coordinator-filter-state.h
File be/src/runtime/coordinator-filter-state.h:

http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/runtime/coordinator-filter-state.h@92
PS5, Line 92: aall
all


http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/runtime/coordinator-filter-state.h@131
PS5, Line 131:   bool bloom_disabled_;
Blank line?


http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/runtime/runtime-filter-bank.h
File be/src/runtime/runtime-filter-bank.h:

http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/runtime/runtime-filter-bank.h@74
PS5, Line 74:   /// Updates a filter's bloom_filter and min_max_filter which 
has been produced by some
nit: quote 'bloom_filter' and 'min_max_filter' for consistency.


http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/runtime/runtime-filter-bank.cc
File be/src/runtime/runtime-filter-bank.cc:

http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/runtime/runtime-filter-bank.cc@202
PS5, Line 202:         
obj_pool_.Add(MinMaxFilter::Create(params.min_max_filter, it->second->type()));
In other places we've designed APIs so that the ObjectPool is passed into 
Create(). I think something like that makes the memory management contract 
between caller and callee more obvious and less error-prone so it may be worth 
doing.


http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/runtime/runtime-filter-bank.cc@230
PS5, Line 230: NULL
nullptr?


http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/runtime/runtime-filter.h
File be/src/runtime/runtime-filter.h:

http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/runtime/runtime-filter.h@46
PS5, Line 46:     registration_time_ = MonotonicMillis();
Not your code, but seems like this should be in the initialiser list.


http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/runtime/runtime-filter.h@94
PS5, Line 94:  private:
The thread safety around these members is really confusing. I have a few 
suggestions to clarify it.


http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/runtime/runtime-filter.h@99
PS5, Line 99:   BloomFilter* bloom_filter_;
I think this should be an AtomicPtr


http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/runtime/runtime-filter.h@102
PS5, Line 102:   MinMaxFilter* min_max_filter_;
I think this should be an AtomicPtr


http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/runtime/runtime-filter.h@108
PS5, Line 108:   int64_t registration_time_;
This doesn't change so let's make it const.


http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/runtime/runtime-filter.h@110
PS5, Line 110:   /// Time, in ms, that the global filter arrived. Set in 
SetFilters().
I think this should be an AtomicInt.


http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/runtime/runtime-filter.h@114
PS5, Line 114:   int64_t filter_size_;
This doesn't change so let's make it const.


http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/runtime/runtime-filter.inline.h
File be/src/runtime/runtime-filter.inline.h:

http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/runtime/runtime-filter.inline.h@41
PS5, Line 41:   DCHECK(bloom_filter_ == NULL && min_max_filter_ == NULL);
nullptr?


http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/runtime/runtime-filter.inline.h@42
PS5, Line 42:   // TODO: Barrier required here to ensure compiler does not both 
inline and re-order
This TODO is really confusing. I don't understand the reasoning. IMO we should 
just make the below member variables atomics and remove this comment.


http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/util/min-max-filter.h
File be/src/util/min-max-filter.h:

http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/util/min-max-filter.h@35
PS5, Line 35: struct MinMaxFilter {
It might be better to make these proper classes and hide the internals? 
Although I could be missing the motivation.


http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/util/min-max-filter.h@38
PS5, Line 38:   virtual void* get_min() = 0;
It's worth documenting in the class comment what representation is used (I 
assume it's the internal representation, same as tuple slots and the RawValue 
API).


http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/util/min-max-filter.h@91
PS5, Line 91:   virtual void ToThrift(TMinMaxFilter* thrift) const;
I find it helpful to add override annotations to overridden methods for 
documentation and also some compile-time enforcement that the code is doing 
what it looks like it's doing.


http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/util/min-max-filter.h@122
PS5, Line 122:   virtual void* get_min() { return &min; }
Is it valid to call these methods if no values have been added?


http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/util/min-max-filter.h@126
PS5, Line 126: IR_ALWAYS_INLINE
I don't think adding IR_ALWAYS_INLINE to a virtual method will do anything in 
most cases since the compiler doesn't down which implementation is called at a 
given callsite. I think the exception is if you specify the version, e.g. 
StringMinMaxFilter::Insert(), or *maybe* if you have a "final" annotation: 
http://en.cppreference.com/w/cpp/language/final.

I think we should either remove this or clarify why it's useful.


http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/util/min-max-filter.cc
File be/src/util/min-max-filter.cc:

http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/util/min-max-filter.cc@118
PS5, Line 118:     "struct.impala::NumericMinMaxFilter.918";
I don't think we should rely on the uniquifying numbers produced by LLVM, seems 
likely to change arbitrarily. It might be easiest to just create named 
subclasses inheriting from the templated versions.



--
To view, visit http://gerrit.cloudera.org:8080/7793
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02bad890f5b5f78388a3041bf38f89369b5e2f1c
Gerrit-Change-Number: 7793
Gerrit-PatchSet: 5
Gerrit-Owner: Thomas Tauber-Marshall <[email protected]>
Gerrit-Reviewer: Lars Volker <[email protected]>
Gerrit-Reviewer: Matthew Jacobs <[email protected]>
Gerrit-Reviewer: Michael Ho <[email protected]>
Gerrit-Reviewer: Mostafa Mokhtar <[email protected]>
Gerrit-Reviewer: Thomas Tauber-Marshall <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-Comment-Date: Thu, 28 Sep 2017 20:59:00 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4252: Min-max runtime filters for Kudu

Reply via email to