Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/7793 )
Change subject: IMPALA-4252: Min-max runtime filters for Kudu ...................................................................... Patch Set 5: (33 comments) I did a pass over the backend part of it. Haven't gotten into all the nitty-gritty yet. The overall approach seems to make sense, had some thoughts on the classes and various parts of the implementation. http://gerrit.cloudera.org:8080/#/c/7793/5//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/7793/5//COMMIT_MSG@9 PS5, Line 9: This patch implements min-max filters for runtime filters. Each I had a general terminology question that popped up as I read through. Does "runtime filter" mean a composition of a bloom filter and min-max filter, or does "runtime filter" mean either a bloom filter or a min-max filter. http://gerrit.cloudera.org:8080/#/c/7793/5//COMMIT_MSG@30 PS5, Line 30: - Still needs more e2e tests. I'm sure you have more tests in mind. I was thinking it would be important to test filters on every supported data type. http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/exec/filter-context.h File be/src/exec/filter-context.h: http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/exec/filter-context.h@80 PS5, Line 80: /// FilterContext contains all metadata for a single runtime filter, and allows the filter This seems to be an example of a place where "runtime filter" could potential mean the composition of a bloom and min-max filter. http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/exec/filter-context.h@107 PS5, Line 107: /// Returns true if this FilterContext should build a min-max filter, if possible. Can you add a blank line between methods? http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/exec/filter-context.cc File be/src/exec/filter-context.cc: http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/exec/filter-context.cc@422 PS5, Line 422: if (filter_expr->type().type != PrimitiveType::TYPE_DECIMAL) { Why is decimal special here? http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/exec/filter-context.cc@434 PS5, Line 434: switch (filter_expr->type().type) { Can we make this a lookup table? Seems like the only difference between the cases is an enum value and a string value. http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/exec/hdfs-parquet-scanner.cc File be/src/exec/hdfs-parquet-scanner.cc: http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/exec/hdfs-parquet-scanner.cc@401 PS5, Line 401: if (filter->BloomAlwaysTrue() Can we phrase the method in a way that doesn't depend on the assumption that Parquet only handles bloom filters (since this method shouldn't really care about the internals of the filter). Does filter->AlwaysTrue() make sense? Because if there's no min-max filter and the bloom filter is always true, then the filter as a whole is always true. http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/exec/kudu-scanner.cc File be/src/exec/kudu-scanner.cc: http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/exec/kudu-scanner.cc@174 PS5, Line 174: // TODO: if the filter is always false, we can skip this entire scan. How important is this and is it blocked by anything else? If it's impactful might be best to just get it done and other of the way. http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/exec/kudu-scanner.cc@175 PS5, Line 175: NULL nullptr http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/exec/kudu-scanner.cc@177 PS5, Line 177: dynamic_cast dynamic_cast seems unnecessary since the type should always be correct - can we just use static_cast? http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/exec/kudu-scanner.cc@179 PS5, Line 179: string const string&? http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/exec/kudu-util.cc File be/src/exec/kudu-util.cc: http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/exec/kudu-util.cc@231 PS5, Line 231: const TimestampValue* tv = reinterpret_cast<const TimestampValue*>(value); This timestamp code is a bit subtle. Is there a sane way to combine the conversion logic with WriteKuduTimestampValue() above? http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/exec/partitioned-hash-join-builder.cc File be/src/exec/partitioned-hash-join-builder.cc: http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/exec/partitioned-hash-join-builder.cc@548 PS5, Line 548: num_enabled_filters Is this meant to double-count the filters if there is a bloom and min-max filter on the same column? I'm not sure what the right thing to do here is but it does seem based on the class structure that we consider a matching pair of bloom filter + min max filter to be a single "runtime filter". http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/runtime/coordinator-filter-state.h File be/src/runtime/coordinator-filter-state.h: http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/runtime/coordinator-filter-state.h@92 PS5, Line 92: aall all http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/runtime/coordinator-filter-state.h@131 PS5, Line 131: bool bloom_disabled_; Blank line? http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/runtime/runtime-filter-bank.h File be/src/runtime/runtime-filter-bank.h: http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/runtime/runtime-filter-bank.h@74 PS5, Line 74: /// Updates a filter's bloom_filter and min_max_filter which has been produced by some nit: quote 'bloom_filter' and 'min_max_filter' for consistency. http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/runtime/runtime-filter-bank.cc File be/src/runtime/runtime-filter-bank.cc: http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/runtime/runtime-filter-bank.cc@202 PS5, Line 202: obj_pool_.Add(MinMaxFilter::Create(params.min_max_filter, it->second->type())); In other places we've designed APIs so that the ObjectPool is passed into Create(). I think something like that makes the memory management contract between caller and callee more obvious and less error-prone so it may be worth doing. http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/runtime/runtime-filter-bank.cc@230 PS5, Line 230: NULL nullptr? http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/runtime/runtime-filter.h File be/src/runtime/runtime-filter.h: http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/runtime/runtime-filter.h@46 PS5, Line 46: registration_time_ = MonotonicMillis(); Not your code, but seems like this should be in the initialiser list. http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/runtime/runtime-filter.h@94 PS5, Line 94: private: The thread safety around these members is really confusing. I have a few suggestions to clarify it. http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/runtime/runtime-filter.h@99 PS5, Line 99: BloomFilter* bloom_filter_; I think this should be an AtomicPtr http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/runtime/runtime-filter.h@102 PS5, Line 102: MinMaxFilter* min_max_filter_; I think this should be an AtomicPtr http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/runtime/runtime-filter.h@108 PS5, Line 108: int64_t registration_time_; This doesn't change so let's make it const. http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/runtime/runtime-filter.h@110 PS5, Line 110: /// Time, in ms, that the global filter arrived. Set in SetFilters(). I think this should be an AtomicInt. http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/runtime/runtime-filter.h@114 PS5, Line 114: int64_t filter_size_; This doesn't change so let's make it const. http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/runtime/runtime-filter.inline.h File be/src/runtime/runtime-filter.inline.h: http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/runtime/runtime-filter.inline.h@41 PS5, Line 41: DCHECK(bloom_filter_ == NULL && min_max_filter_ == NULL); nullptr? http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/runtime/runtime-filter.inline.h@42 PS5, Line 42: // TODO: Barrier required here to ensure compiler does not both inline and re-order This TODO is really confusing. I don't understand the reasoning. IMO we should just make the below member variables atomics and remove this comment. http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/util/min-max-filter.h File be/src/util/min-max-filter.h: http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/util/min-max-filter.h@35 PS5, Line 35: struct MinMaxFilter { It might be better to make these proper classes and hide the internals? Although I could be missing the motivation. http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/util/min-max-filter.h@38 PS5, Line 38: virtual void* get_min() = 0; It's worth documenting in the class comment what representation is used (I assume it's the internal representation, same as tuple slots and the RawValue API). http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/util/min-max-filter.h@91 PS5, Line 91: virtual void ToThrift(TMinMaxFilter* thrift) const; I find it helpful to add override annotations to overridden methods for documentation and also some compile-time enforcement that the code is doing what it looks like it's doing. http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/util/min-max-filter.h@122 PS5, Line 122: virtual void* get_min() { return &min; } Is it valid to call these methods if no values have been added? http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/util/min-max-filter.h@126 PS5, Line 126: IR_ALWAYS_INLINE I don't think adding IR_ALWAYS_INLINE to a virtual method will do anything in most cases since the compiler doesn't down which implementation is called at a given callsite. I think the exception is if you specify the version, e.g. StringMinMaxFilter::Insert(), or *maybe* if you have a "final" annotation: http://en.cppreference.com/w/cpp/language/final. I think we should either remove this or clarify why it's useful. http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/util/min-max-filter.cc File be/src/util/min-max-filter.cc: http://gerrit.cloudera.org:8080/#/c/7793/5/be/src/util/min-max-filter.cc@118 PS5, Line 118: "struct.impala::NumericMinMaxFilter.918"; I don't think we should rely on the uniquifying numbers produced by LLVM, seems likely to change arbitrarily. It might be easiest to just create named subclasses inheriting from the templated versions. -- To view, visit http://gerrit.cloudera.org:8080/7793 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I02bad890f5b5f78388a3041bf38f89369b5e2f1c Gerrit-Change-Number: 7793 Gerrit-PatchSet: 5 Gerrit-Owner: Thomas Tauber-Marshall <[email protected]> Gerrit-Reviewer: Lars Volker <[email protected]> Gerrit-Reviewer: Matthew Jacobs <[email protected]> Gerrit-Reviewer: Michael Ho <[email protected]> Gerrit-Reviewer: Mostafa Mokhtar <[email protected]> Gerrit-Reviewer: Thomas Tauber-Marshall <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Comment-Date: Thu, 28 Sep 2017 20:59:00 +0000 Gerrit-HasComments: Yes
