Aman Sinha has posted comments on this change. ( http://gerrit.cloudera.org:8080/17075 )
Change subject: IMPALA-10494: Making use of the min/max column stats to improve min/max filters ...................................................................... Patch Set 21: (5 comments) I haven't analyzed how exactly the overlap of min-max filters works..so will go through the design doc and maybe the original commit. In the meantime, sending some initial comments. http://gerrit.cloudera.org:8080/#/c/17075/19/be/src/exec/catalog-op-executor.cc File be/src/exec/catalog-op-executor.cc: http://gerrit.cloudera.org:8080/#/c/17075/19/be/src/exec/catalog-op-executor.cc@316 PS19, Line 316: or non-string : // columns. Update this since min/max stats are not getting computed for Decimal types. Also, are we sure that allowing float/double min-max filters is ok ? These are not precise data types. http://gerrit.cloudera.org:8080/#/c/17075/19/be/src/exec/filter-context.cc File be/src/exec/filter-context.cc: http://gerrit.cloudera.org:8080/#/c/17075/19/be/src/exec/filter-context.cc@474 PS19, Line 474: case PrimitiveType::TYPE_TIMESTAMP: The commit message says the min-max column stats is used for integer, float and double so I was not expecting to see timestamp and date types also considered here. Shouldn't they just to the default case and return false ? http://gerrit.cloudera.org:8080/#/c/17075/19/be/src/exec/incr-stats-util.cc File be/src/exec/incr-stats-util.cc: http://gerrit.cloudera.org:8080/#/c/17075/19/be/src/exec/incr-stats-util.cc@260 PS19, Line 260: apache::hive::service::cli::thrift::TColumnValue It would be good to create a typedef for the hive version of TColumnValue. Also, the package name looks different from what I see in the source code here: https://github.com/apache/hive/blob/master/service-rpc/src/gen/thrift/gen-javabean/org/apache/hive/service/rpc/thrift/TColumnValue.java#L7 http://gerrit.cloudera.org:8080/#/c/17075/19/be/src/exec/partitioned-hash-join-builder.cc File be/src/exec/partitioned-hash-join-builder.cc: http://gerrit.cloudera.org:8080/#/c/17075/19/be/src/exec/partitioned-hash-join-builder.cc@957 PS19, Line 957: VLOG(3) << "The filter is set to always true."; nit: Would be useful to print that this was a min-max filter and probably the filter id. http://gerrit.cloudera.org:8080/#/c/17075/19/testdata/workloads/functional-query/queries/QueryTest/overlap_min_max_filters.test File testdata/workloads/functional-query/queries/QueryTest/overlap_min_max_filters.test: http://gerrit.cloudera.org:8080/#/c/17075/19/testdata/workloads/functional-query/queries/QueryTest/overlap_min_max_filters.test@10 PS19, Line 10: CREATE TABLE unique_database.lineitem_orderkey_only(l_orderkey bigint) Creating large tables during a test run adds unacceptable latency to the testing time. Can the db and table creation steps not be moved into testdata/datasets/tpch/ as a schema template for min-max filters ? -- To view, visit http://gerrit.cloudera.org:8080/17075 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df Gerrit-Change-Number: 17075 Gerrit-PatchSet: 21 Gerrit-Owner: Qifan Chen <[email protected]> Gerrit-Reviewer: Aman Sinha <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Qifan Chen <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Mon, 15 Mar 2021 18:14:25 +0000 Gerrit-HasComments: Yes
