Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7793 )

Change subject: IMPALA-4252: Min-max runtime filters for Kudu
......................................................................


Patch Set 6:

(17 comments)

http://gerrit.cloudera.org:8080/#/c/7793/6//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/7793/6//COMMIT_MSG@15
PS6, Line 15: them into the Kudu client. Because the Kudu client doesn't 
provide a
Is there any documentation for Kudu about what ordering Kudu uses for 
comparisons for different types? I guess we already push filters so we're 
already depending on the orderings being the same as Impala, but I'd be 
interested to know.

We had a similar discussion about Parquet and the spec got clarified a lot as a 
result (https://github.com/apache/parquet-format/blob/master/LogicalTypes.md 
now specifies it). Parquet-MR also had various odd behaviour that got shaken 
out as a result of this.


http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/exec/filter-context.h
File be/src/exec/filter-context.h:

http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/exec/filter-context.h@106
PS6, Line 106: bloom_filter
has_bloom_filter() ? At callsites it reads like this should return a filter 
rather than a bool.


http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/exec/filter-context.h@109
PS6, Line 109: min_max_filter
has_min_max_filter?


http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/runtime/coordinator.cc@1220
PS6, Line 1220:       MinMaxFilter::Or(src_type(), params.min_max_filter, 
min_max_filter_.get());
I mentioned this in a later comment, but we should consider what happens if the 
min/max values are very large - maybe thre's some memory management similar to 
above.


http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/runtime/runtime-filter-bank.cc
File be/src/runtime/runtime-filter-bank.cc:

http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/runtime/runtime-filter-bank.cc@235
PS6, Line 235:   MinMaxFilter* min_max_filter = MinMaxFilter::Create(type, 
&obj_pool_);
(nit) - could combine into one line.


http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/runtime/runtime-filter-ir.cc
File be/src/runtime/runtime-filter-ir.cc:

http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/runtime/runtime-filter-ir.cc@27
PS6, Line 27:   // to a) the atomicity of / pointer assignments and b) the x86 
TSO memory model.
Comment not relevant any more since we're using actual atomics?


http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/util/min-max-filter-ir.cc
File be/src/util/min-max-filter-ir.cc:

http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/util/min-max-filter-ir.cc@53
PS6, Line 53:       min_str = string(value->ptr, value->len);
We should think through the behaviour with very large strings. It may make 
sense to disable the filters if the strings are too large instead of allocating 
arbitrarily large amounts of memory.

Maybe we could also do some trickery with truncating the strings to avoid 
handling the filters not existing. E.g. if we have a 4 byte limit on min/max, 
replace min="aaaaaaa" with min="aaaa" and max="zzzzzz" with max="zzz(". 
Interested to hear what you think - might be too clever but seems like it could 
work.

It would be good to also allocate tracked memory for the string. The Parquet 
ColumnStats class solves a similar problem using StringBuffers that allocate 
from a MemPool.


http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/util/min-max-filter.h
File be/src/util/min-max-filter.h:

http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/util/min-max-filter.h@83
PS6, Line 83: #define NUMERIC_MIN_MAX_FILTER(NAME, TYPE)                        
 \
Yeah, the macro seems like it may be the least of the evils.


http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/util/min-max-filter.cc
File be/src/util/min-max-filter.cc:

http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/util/min-max-filter.cc@77
PS6, Line 77:   std::string NAME##MinMaxFilter::DebugString() const {           
             \
nit: don't need std:: prefix in .cc files.


http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/util/min-max-filter.cc@139
PS6, Line 139:       out->min.__set_string_val(std::min(in.min.string_val, 
out->min.string_val));
I feel like we should use our StringValue::Compare() function instead of 
relying on std::string's comparison function. I'm not sure if they are exactly 
the same but if we compare these the same way as we compare StringValue's it's 
easier to reason about.


http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/util/min-max-filter.cc@291
PS6, Line 291:       BigIntMinMaxFilter::Or(in, out);
Shouldn't this be calling the timestamp method?


http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/util/min-max-filter.cc@327
PS6, Line 327:   }
Timestamp?


http://gerrit.cloudera.org:8080/#/c/7793/6/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
File fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java:

http://gerrit.cloudera.org:8080/#/c/7793/6/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java@379
PS6, Line 379:         if (slotRef == null || slotRef.getDesc().getColumn() == 
null
Are there planner tests for all these cases?


http://gerrit.cloudera.org:8080/#/c/7793/6/testdata/workloads/functional-query/queries/QueryTest/runtime_filters.test
File testdata/workloads/functional-query/queries/QueryTest/runtime_filters.test:

http://gerrit.cloudera.org:8080/#/c/7793/6/testdata/workloads/functional-query/queries/QueryTest/runtime_filters.test@30
PS6, Line 30: kudu
Can we make this something like table_format=kudu so that it's a bit more 
self-describing and future-proof?


http://gerrit.cloudera.org:8080/#/c/7793/6/tests/query_test/test_runtime_filters.py
File tests/query_test/test_runtime_filters.py:

http://gerrit.cloudera.org:8080/#/c/7793/6/tests/query_test/test_runtime_filters.py@61
PS6, Line 61:       lambda v: v.get_value('table_format').file_format not in 
['hbase', 'kudu'])
nit: 4 character indents on line continuations (I guess this is preserved from 
the existing code).


http://gerrit.cloudera.org:8080/#/c/7793/6/tests/util/test_file_parser.py
File tests/util/test_file_parser.py:

http://gerrit.cloudera.org:8080/#/c/7793/6/tests/util/test_file_parser.py@252
PS6, Line 252:           allowed_formats = ['kudu']
Thanks for adding this validation, hopefully will lead people in the right 
direction if they need to generalise this.


http://gerrit.cloudera.org:8080/#/c/7793/6/tests/util/test_file_parser.py@254
PS6, Line 254: hint
hint? Isn't it a table format?



--
To view, visit http://gerrit.cloudera.org:8080/7793
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02bad890f5b5f78388a3041bf38f89369b5e2f1c
Gerrit-Change-Number: 7793
Gerrit-PatchSet: 6
Gerrit-Owner: Thomas Tauber-Marshall <[email protected]>
Gerrit-Reviewer: Lars Volker <[email protected]>
Gerrit-Reviewer: Matthew Jacobs <[email protected]>
Gerrit-Reviewer: Michael Ho <[email protected]>
Gerrit-Reviewer: Mostafa Mokhtar <[email protected]>
Gerrit-Reviewer: Thomas Tauber-Marshall <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-Comment-Date: Thu, 12 Oct 2017 00:13:34 +0000
Gerrit-HasComments: Yes

Reply via email to