Marcel Kornacker has posted comments on this change. Change subject: IMPALA-2328: Read support for min/max Parquet statistics ......................................................................
Patch Set 4: (26 comments) http://gerrit.cloudera.org:8080/#/c/6032/4/be/src/exec/hdfs-parquet-scanner.cc File be/src/exec/hdfs-parquet-scanner.cc: Line 472: ScopedBuffer tuple_buffer(scan_node_->mem_tracker()); no need to allocate this and go through the setup for every single row group. Line 486: conjuncts_with_stats.reserve(min_max_conjuncts_ctxs_.size()); there's too much memory allocation going on for what are essentially static structures. Line 496: unique_ptr<ColumnStatsBase> stats( there's no need for memory allocation here Line 501: if (e->function_name() == "lt" || e->function_name() == "le") { introduce named constants (in the appropriate class) http://gerrit.cloudera.org:8080/#/c/6032/4/be/src/exec/hdfs-parquet-scanner.h File be/src/exec/hdfs-parquet-scanner.h: Line 378: /// Min/max statistics contexts. who owns them? http://gerrit.cloudera.org:8080/#/c/6032/4/be/src/exec/hdfs-scan-node-base.cc File be/src/exec/hdfs-scan-node-base.cc: Line 182: if (min_max_tuple_id_ > -1) { ">" or < don't make sense for tuple ids http://gerrit.cloudera.org:8080/#/c/6032/4/be/src/exec/parquet-column-stats.h File be/src/exec/parquet-column-stats.h: Line 40: /// writing parquet files. It provides an interface to populate a parquet::Statistics this description isn't true anymore, it's now also an intermediate staging area for the min/max values. i'm not sure this expanded abstraction makes sense (why do you need an intermediate staging area?). Line 70: /// 'nullptr' for unsupported types. The caller is responsible for freeing the memory. this should be created in an objectpool Line 87: virtual void WriteMinToBuffer(void* buffer) const = 0; remove "tobuffer", that's clear from the parameters. what's unclear is the format of the written data. is the void* simply a slot? 'buffer' is too generic. Line 126: max_value_(max_value) { formatting it looks like you're populating an object of this class and then write out the min/max to the corresponding slots. why not go directly from parquet::statistics to the slots? http://gerrit.cloudera.org:8080/#/c/6032/4/common/thrift/PlanNodes.thrift File common/thrift/PlanNodes.thrift: Line 208: // Conjuncts that can be evaluated against the statistics of parquet files. Each refer to thrift structs explicitly, where appropriate. Line 209: // conjunct references the slot in 'min_max_tuple_id' with the same index. i didn't understand the last sentence. http://gerrit.cloudera.org:8080/#/c/6032/4/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: Line 143: // List of conjuncts for min/max values, that are used to skip data when scanning "min/max values" is vague, refer to thrift structs explicitly. Line 145: private List<Expr> minMaxConjunts_ = Lists.newArrayList(); conjuncts Line 147: // Tuple that is used to materialize statistics when scanning Parquet files. describe what's in it. i should have an understanding of the overall approach just based on reading the comments for the related data structures. this might best fit into the class header. Line 290: * Builds a predicate to evaluate statistical values by combining 'inputSlot', make 'statistical values' concrete Line 291: * 'inputPred' and 'op' into a new predicate, and adding it to 'minMaxConjunts_'. missing other side effects Line 297: Preconditions.checkState(constExpr.isConstant()); why not a literal? Line 303: // Make a new slot from the slot descriptor. what do you mean by 'slot'? (slot/slot descriptor are usually synonyms) Line 315: * TODO: This method currently doesn't de-dup slots that are referenced multiple times. resolve Line 328: // We only support slot refs on the left hand side of the predicate, a rewriting this overlaps with what kuduscannode does, and there's already a similar concept in the form of binarypredicate.getboundslot. figure out the similarity (might require a new concept) and move the functionality into binarypredicate. Line 343: { { on preceding line Line 658: msg.hdfs_scan_node.setMin_max_tuple_id(minMaxTuple_.getId().asInt()); check that it's not invalid http://gerrit.cloudera.org:8080/#/c/6032/4/fe/src/main/java/org/apache/impala/rewrite/NormalizeBinaryPredicatesRule.java File fe/src/main/java/org/apache/impala/rewrite/NormalizeBinaryPredicatesRule.java: Line 35: public class NormalizeBinaryPredicatesRule implements ExprRewriteRule { add tests for rule Line 40: if (!expr.isAnalyzed()) return expr; why do you need this? http://gerrit.cloudera.org:8080/#/c/6032/4/testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test File testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test: Line 1047: statistics predicates: c_nationkey <= 16, c_nationkey >= 16 i think it's better to retain the original predicate (you don't need to show me here that you know how to transform it correctly against the min/max tuple). also, this should only be printed for verbose explain output. -- To view, visit http://gerrit.cloudera.org:8080/6032 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4 Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker <[email protected]> Gerrit-Reviewer: Alex Behm <[email protected]> Gerrit-Reviewer: Lars Volker <[email protected]> Gerrit-Reviewer: Marcel Kornacker <[email protected]> Gerrit-Reviewer: Matthew Jacobs <[email protected]> Gerrit-HasComments: Yes
