Aman Sinha has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve 
min/max filters
......................................................................


Patch Set 21:

(5 comments)

I haven't analyzed how exactly the overlap of min-max filters works..so will go 
through the design doc and maybe the original commit.  In the meantime, sending 
some initial comments.

http://gerrit.cloudera.org:8080/#/c/17075/19/be/src/exec/catalog-op-executor.cc
File be/src/exec/catalog-op-executor.cc:

http://gerrit.cloudera.org:8080/#/c/17075/19/be/src/exec/catalog-op-executor.cc@316
PS19, Line 316: or non-string
              :   // columns.
Update this since min/max stats are not getting computed for Decimal types.  
Also, are we sure that allowing float/double min-max filters is ok ? These are 
not precise data types.


http://gerrit.cloudera.org:8080/#/c/17075/19/be/src/exec/filter-context.cc
File be/src/exec/filter-context.cc:

http://gerrit.cloudera.org:8080/#/c/17075/19/be/src/exec/filter-context.cc@474
PS19, Line 474:     case PrimitiveType::TYPE_TIMESTAMP:
The commit message says the min-max column stats is used for integer, float and 
double so I was not expecting to see timestamp and date types also considered 
here.  Shouldn't they just to the default case and return false ?


http://gerrit.cloudera.org:8080/#/c/17075/19/be/src/exec/incr-stats-util.cc
File be/src/exec/incr-stats-util.cc:

http://gerrit.cloudera.org:8080/#/c/17075/19/be/src/exec/incr-stats-util.cc@260
PS19, Line 260: apache::hive::service::cli::thrift::TColumnValue
It would be good to create a typedef for the hive version of TColumnValue.  
Also, the package name looks different from what I see in the source code here: 
https://github.com/apache/hive/blob/master/service-rpc/src/gen/thrift/gen-javabean/org/apache/hive/service/rpc/thrift/TColumnValue.java#L7


http://gerrit.cloudera.org:8080/#/c/17075/19/be/src/exec/partitioned-hash-join-builder.cc
File be/src/exec/partitioned-hash-join-builder.cc:

http://gerrit.cloudera.org:8080/#/c/17075/19/be/src/exec/partitioned-hash-join-builder.cc@957
PS19, Line 957:         VLOG(3) << "The filter is set to always true.";
nit: Would be useful to print that this was a min-max filter and probably the 
filter id.


http://gerrit.cloudera.org:8080/#/c/17075/19/testdata/workloads/functional-query/queries/QueryTest/overlap_min_max_filters.test
File 
testdata/workloads/functional-query/queries/QueryTest/overlap_min_max_filters.test:

http://gerrit.cloudera.org:8080/#/c/17075/19/testdata/workloads/functional-query/queries/QueryTest/overlap_min_max_filters.test@10
PS19, Line 10: CREATE TABLE unique_database.lineitem_orderkey_only(l_orderkey 
bigint)
Creating large tables during a test run adds unacceptable latency to the  
testing time.   Can the db and table creation steps not be moved into 
testdata/datasets/tpch/  as a schema template for min-max filters ?



--
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 21
Gerrit-Owner: Qifan Chen <[email protected]>
Gerrit-Reviewer: Aman Sinha <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Qifan Chen <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Mon, 15 Mar 2021 18:14:25 +0000
Gerrit-HasComments: Yes

Reply via email to