Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/7793 )
Change subject: IMPALA-4252: Min-max runtime filters for Kudu ...................................................................... Patch Set 6: (17 comments) http://gerrit.cloudera.org:8080/#/c/7793/6//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/7793/6//COMMIT_MSG@15 PS6, Line 15: them into the Kudu client. Because the Kudu client doesn't provide a Is there any documentation for Kudu about what ordering Kudu uses for comparisons for different types? I guess we already push filters so we're already depending on the orderings being the same as Impala, but I'd be interested to know. We had a similar discussion about Parquet and the spec got clarified a lot as a result (https://github.com/apache/parquet-format/blob/master/LogicalTypes.md now specifies it). Parquet-MR also had various odd behaviour that got shaken out as a result of this. http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/exec/filter-context.h File be/src/exec/filter-context.h: http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/exec/filter-context.h@106 PS6, Line 106: bloom_filter has_bloom_filter() ? At callsites it reads like this should return a filter rather than a bool. http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/exec/filter-context.h@109 PS6, Line 109: min_max_filter has_min_max_filter? http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/runtime/coordinator.cc File be/src/runtime/coordinator.cc: http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/runtime/coordinator.cc@1220 PS6, Line 1220: MinMaxFilter::Or(src_type(), params.min_max_filter, min_max_filter_.get()); I mentioned this in a later comment, but we should consider what happens if the min/max values are very large - maybe thre's some memory management similar to above. http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/runtime/runtime-filter-bank.cc File be/src/runtime/runtime-filter-bank.cc: http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/runtime/runtime-filter-bank.cc@235 PS6, Line 235: MinMaxFilter* min_max_filter = MinMaxFilter::Create(type, &obj_pool_); (nit) - could combine into one line. http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/runtime/runtime-filter-ir.cc File be/src/runtime/runtime-filter-ir.cc: http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/runtime/runtime-filter-ir.cc@27 PS6, Line 27: // to a) the atomicity of / pointer assignments and b) the x86 TSO memory model. Comment not relevant any more since we're using actual atomics? http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/util/min-max-filter-ir.cc File be/src/util/min-max-filter-ir.cc: http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/util/min-max-filter-ir.cc@53 PS6, Line 53: min_str = string(value->ptr, value->len); We should think through the behaviour with very large strings. It may make sense to disable the filters if the strings are too large instead of allocating arbitrarily large amounts of memory. Maybe we could also do some trickery with truncating the strings to avoid handling the filters not existing. E.g. if we have a 4 byte limit on min/max, replace min="aaaaaaa" with min="aaaa" and max="zzzzzz" with max="zzz(". Interested to hear what you think - might be too clever but seems like it could work. It would be good to also allocate tracked memory for the string. The Parquet ColumnStats class solves a similar problem using StringBuffers that allocate from a MemPool. http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/util/min-max-filter.h File be/src/util/min-max-filter.h: http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/util/min-max-filter.h@83 PS6, Line 83: #define NUMERIC_MIN_MAX_FILTER(NAME, TYPE) \ Yeah, the macro seems like it may be the least of the evils. http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/util/min-max-filter.cc File be/src/util/min-max-filter.cc: http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/util/min-max-filter.cc@77 PS6, Line 77: std::string NAME##MinMaxFilter::DebugString() const { \ nit: don't need std:: prefix in .cc files. http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/util/min-max-filter.cc@139 PS6, Line 139: out->min.__set_string_val(std::min(in.min.string_val, out->min.string_val)); I feel like we should use our StringValue::Compare() function instead of relying on std::string's comparison function. I'm not sure if they are exactly the same but if we compare these the same way as we compare StringValue's it's easier to reason about. http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/util/min-max-filter.cc@291 PS6, Line 291: BigIntMinMaxFilter::Or(in, out); Shouldn't this be calling the timestamp method? http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/util/min-max-filter.cc@327 PS6, Line 327: } Timestamp? http://gerrit.cloudera.org:8080/#/c/7793/6/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java File fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java: http://gerrit.cloudera.org:8080/#/c/7793/6/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java@379 PS6, Line 379: if (slotRef == null || slotRef.getDesc().getColumn() == null Are there planner tests for all these cases? http://gerrit.cloudera.org:8080/#/c/7793/6/testdata/workloads/functional-query/queries/QueryTest/runtime_filters.test File testdata/workloads/functional-query/queries/QueryTest/runtime_filters.test: http://gerrit.cloudera.org:8080/#/c/7793/6/testdata/workloads/functional-query/queries/QueryTest/runtime_filters.test@30 PS6, Line 30: kudu Can we make this something like table_format=kudu so that it's a bit more self-describing and future-proof? http://gerrit.cloudera.org:8080/#/c/7793/6/tests/query_test/test_runtime_filters.py File tests/query_test/test_runtime_filters.py: http://gerrit.cloudera.org:8080/#/c/7793/6/tests/query_test/test_runtime_filters.py@61 PS6, Line 61: lambda v: v.get_value('table_format').file_format not in ['hbase', 'kudu']) nit: 4 character indents on line continuations (I guess this is preserved from the existing code). http://gerrit.cloudera.org:8080/#/c/7793/6/tests/util/test_file_parser.py File tests/util/test_file_parser.py: http://gerrit.cloudera.org:8080/#/c/7793/6/tests/util/test_file_parser.py@252 PS6, Line 252: allowed_formats = ['kudu'] Thanks for adding this validation, hopefully will lead people in the right direction if they need to generalise this. http://gerrit.cloudera.org:8080/#/c/7793/6/tests/util/test_file_parser.py@254 PS6, Line 254: hint hint? Isn't it a table format? -- To view, visit http://gerrit.cloudera.org:8080/7793 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I02bad890f5b5f78388a3041bf38f89369b5e2f1c Gerrit-Change-Number: 7793 Gerrit-PatchSet: 6 Gerrit-Owner: Thomas Tauber-Marshall <[email protected]> Gerrit-Reviewer: Lars Volker <[email protected]> Gerrit-Reviewer: Matthew Jacobs <[email protected]> Gerrit-Reviewer: Michael Ho <[email protected]> Gerrit-Reviewer: Mostafa Mokhtar <[email protected]> Gerrit-Reviewer: Thomas Tauber-Marshall <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Comment-Date: Thu, 12 Oct 2017 00:13:34 +0000 Gerrit-HasComments: Yes
