[Impala-ASF-CR] IMPALA-2328: Read support for min/max Parquet statistics

Marcel Kornacker (Code Review) Sun, 19 Feb 2017 11:19:27 -0800

Marcel Kornacker has posted comments on this change.

Change subject: IMPALA-2328: Read support for min/max Parquet statistics
......................................................................



Patch Set 4:

(26 comments)

http://gerrit.cloudera.org:8080/#/c/6032/4/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 472:   ScopedBuffer tuple_buffer(scan_node_->mem_tracker());
no need to allocate this and go through the setup for every single row group.


Line 486:   conjuncts_with_stats.reserve(min_max_conjuncts_ctxs_.size());
there's too much memory allocation going on for what are essentially static 
structures.


Line 496:     unique_ptr<ColumnStatsBase> stats(
there's no need for memory allocation here


Line 501:     if (e->function_name() == "lt" || e->function_name() == "le") {
introduce named constants (in the appropriate class)


http://gerrit.cloudera.org:8080/#/c/6032/4/be/src/exec/hdfs-parquet-scanner.h
File be/src/exec/hdfs-parquet-scanner.h:

Line 378:   /// Min/max statistics contexts.
who owns them?


http://gerrit.cloudera.org:8080/#/c/6032/4/be/src/exec/hdfs-scan-node-base.cc
File be/src/exec/hdfs-scan-node-base.cc:

Line 182:   if (min_max_tuple_id_ > -1) {
">" or < don't make sense for tuple ids


http://gerrit.cloudera.org:8080/#/c/6032/4/be/src/exec/parquet-column-stats.h
File be/src/exec/parquet-column-stats.h:

Line 40: /// writing parquet files. It provides an interface to populate a 
parquet::Statistics
this description isn't true anymore, it's now also an intermediate staging area 
for the min/max values.

i'm not sure this expanded abstraction makes sense (why do you need an 
intermediate staging area?).


Line 70:   /// 'nullptr' for unsupported types. The caller is responsible for 
freeing the memory.
this should be created in an objectpool


Line 87:   virtual void WriteMinToBuffer(void* buffer) const = 0;
remove "tobuffer", that's clear from the parameters. what's unclear is the 
format of the written data. is the void* simply a slot? 'buffer' is too generic.


Line 126:     max_value_(max_value) {
formatting

it looks like you're populating an object of this class and then write out the 
min/max to the corresponding slots. why not go directly from 
parquet::statistics to the slots?


http://gerrit.cloudera.org:8080/#/c/6032/4/common/thrift/PlanNodes.thrift
File common/thrift/PlanNodes.thrift:

Line 208:   // Conjuncts that can be evaluated against the statistics of 
parquet files. Each
refer to thrift structs explicitly, where appropriate.


Line 209:   // conjunct references the slot in 'min_max_tuple_id' with the same 
index.
i didn't understand the last sentence.


http://gerrit.cloudera.org:8080/#/c/6032/4/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

Line 143:   // List of conjuncts for min/max values, that are used to skip data 
when scanning
"min/max values" is vague, refer to thrift structs explicitly.


Line 145:   private List<Expr> minMaxConjunts_ = Lists.newArrayList();
conjuncts


Line 147:   // Tuple that is used to materialize statistics when scanning 
Parquet files.
describe what's in it.

i should have an understanding of the overall approach just based on reading 
the comments for the related data structures. this might best fit into the 
class header.


Line 290:    * Builds a predicate to evaluate statistical values by combining 
'inputSlot',
make 'statistical values' concrete


Line 291:    * 'inputPred' and 'op' into a new predicate, and adding it to 
'minMaxConjunts_'.
missing other side effects


Line 297:     Preconditions.checkState(constExpr.isConstant());
why not a literal?


Line 303:     // Make a new slot from the slot descriptor.
what do you mean by 'slot'? (slot/slot descriptor are usually synonyms)


Line 315:    * TODO: This method currently doesn't de-dup slots that are 
referenced multiple times.
resolve


Line 328:       // We only support slot refs on the left hand side of the 
predicate, a rewriting
this overlaps with what kuduscannode does, and there's already a similar 
concept in the form of binarypredicate.getboundslot. figure out the similarity 
(might require a new concept) and move the functionality into binarypredicate.


Line 343:       {
{ on preceding line


Line 658:       
msg.hdfs_scan_node.setMin_max_tuple_id(minMaxTuple_.getId().asInt());
check that it's not invalid


http://gerrit.cloudera.org:8080/#/c/6032/4/fe/src/main/java/org/apache/impala/rewrite/NormalizeBinaryPredicatesRule.java
File 
fe/src/main/java/org/apache/impala/rewrite/NormalizeBinaryPredicatesRule.java:

Line 35: public class NormalizeBinaryPredicatesRule implements ExprRewriteRule {
add tests for rule


Line 40:     if (!expr.isAnalyzed()) return expr;
why do you need this?


http://gerrit.cloudera.org:8080/#/c/6032/4/testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test
File testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test:

Line 1047:    statistics predicates: c_nationkey <= 16, c_nationkey >= 16
i think it's better to retain the original predicate (you don't need to show me 
here that you know how to transform it correctly against the min/max tuple).

also, this should only be printed for verbose explain output.


-- 
To view, visit http://gerrit.cloudera.org:8080/6032
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker <[email protected]>
Gerrit-Reviewer: Alex Behm <[email protected]>
Gerrit-Reviewer: Lars Volker <[email protected]>
Gerrit-Reviewer: Marcel Kornacker <[email protected]>
Gerrit-Reviewer: Matthew Jacobs <[email protected]>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2328: Read support for min/max Parquet statistics

Reply via email to