Lars Volker has posted comments on this change. Change subject: IMPALA-2328: Read support for min/max Parquet statistics ......................................................................
Patch Set 6: (14 comments) Thanks for the review. Please see PS7. http://gerrit.cloudera.org:8080/#/c/6032/6/be/src/exec/hdfs-parquet-scanner.h File be/src/exec/hdfs-parquet-scanner.h: Line 382: /// Min/max statistics contexts, owned by instances of this class. > they are not owned by the class instances themselves, they're allocated in Done Line 386: /// 'min_max_conjunct_ctxs_', which are used when evaluating parquet::Statistics. > somewhat incomprehensible comment. Done Line 476: Status EvaluateRowGroupStats(const parquet::RowGroup& row_group, bool* skip_row_group); > you're not evaluating stats, you're evaluating predicates on the stats Done http://gerrit.cloudera.org:8080/#/c/6032/6/be/src/exec/parquet-column-stats.h File be/src/exec/parquet-column-stats.h: Line 31: enum class StatsField { MIN, MAX }; > move into ColumnStatsBase so it's clear what stats this is referring to. The other two values are i64 null_count and i64 distinct_count, for which I think we should have proper getters, i.e. we don't need to consider their type and method of decoding when reading them. Should we prefer an enum-based getter for those over ColumnStatsBase::ReadNullCountFromThrift() and ReadDistinctCountFromThrift()? Line 114: static bool ReadFromThrift(const parquet::Statistics& thrift_stats, const StatsField& stats_field, > move function bodies into parquet-column-stats-inl.h Done http://gerrit.cloudera.org:8080/#/c/6032/4/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: Line 297: collectionConjuncts_.clear(); > true, but those should be identical, given that we rewrite constants as lit Constant rewriting is controlled by the enable_expr_rewrites query option, changing the code to filter on literals and turning the query option off disables statistics for predicates like "a < 3 - 3". There is a test for this in parquet_stats.test:L206 that would fail. Let me know if you prefer to switch to literals. http://gerrit.cloudera.org:8080/#/c/6032/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: Line 85: * conjuncts, that are passed to the backend and will be evaluated against the > "conjuncts that" Done Line 154: private List<Expr> minMaxPlanConjuncts_ = Lists.newArrayList(); > "plan conjuncts" doesn't really mean anything. original conjuncts? Done Line 302: * Builds a predicate to evaluate parquet::Statistics by copying 'inputSlot' into > "to evaluate against" Done Line 347: // Only constant exprs can be evaluated against the parquet::Statistics. This > drop "the" Done Line 349: if (!constExpr.isConstant()) continue; > as pointed out before, you shouldn't see non-literal constants here See my reply in PS4, constant folding depends on the enable_expr_rewrites query option. http://gerrit.cloudera.org:8080/#/c/6032/4/testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test File testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test: Line 1047: runtime filters: RF000 -> c_custkey > extended is fine. Done http://gerrit.cloudera.org:8080/#/c/6032/6/testdata/workloads/functional-planner/queries/PlannerTest/constant-folding.test File testdata/workloads/functional-planner/queries/PlannerTest/constant-folding.test: Line 1: # Test HDFS scan node. > run this file in extended mode so you see the stats predicates. Done http://gerrit.cloudera.org:8080/#/c/6032/6/testdata/workloads/functional-query/queries/QueryTest/parquet_stats.test File testdata/workloads/functional-query/queries/QueryTest/parquet_stats.test: Line 206: # Test that without expr rewrite and thus without constant folding, constant exprs still > what's the point of those tests? See my comment regarding literal vs. constant exprs in PS4. These tests make sure that we don't break support for constant exprs. -- To view, visit http://gerrit.cloudera.org:8080/6032 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4 Gerrit-PatchSet: 6 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker <[email protected]> Gerrit-Reviewer: Alex Behm <[email protected]> Gerrit-Reviewer: Lars Volker <[email protected]> Gerrit-Reviewer: Marcel Kornacker <[email protected]> Gerrit-Reviewer: Matthew Jacobs <[email protected]> Gerrit-Reviewer: Matthew Mulder <[email protected]> Gerrit-Reviewer: Mostafa Mokhtar <[email protected]> Gerrit-HasComments: Yes
