Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/17075 )
Change subject: IMPALA-10494: Making use of the min/max column stats to improve min/max filters ...................................................................... Patch Set 28: (3 comments) Rework. http://gerrit.cloudera.org:8080/#/c/17075/27/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java File fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java: http://gerrit.cloudera.org:8080/#/c/17075/27/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java@260 PS27, Line 260: } > > Looks like mixture of files of different format (like Parquet and ORC at Reworked method hasAtLeastOneParquetPartition() to iterate until a Parquet partition is met. DONE. http://gerrit.cloudera.org:8080/#/c/17075/28/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: http://gerrit.cloudera.org:8080/#/c/17075/28/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@387 PS28, Line 387: /* > Can you remove this method. Done http://gerrit.cloudera.org:8080/#/c/17075/28/testdata/workloads/functional-query/queries/QueryTest/compute-stats-column-minmax.test File testdata/workloads/functional-query/queries/QueryTest/compute-stats-column-minmax.test: http://gerrit.cloudera.org:8080/#/c/17075/28/testdata/workloads/functional-query/queries/QueryTest/compute-stats-column-minmax.test@98 PS28, Line 98: # Create a new hudi parquet table. > Please note that it creates a plain Parquet table from the data in the Hudi Good catch. Did a unit test with the hudi table. All seems good. I think it probably is okay to leave out the hudi test for now. Server version: impalad version 4.0.0-SNAPSHOT DEBUG (build e45818b616ab4ff81fa512c8257bc8fac594094a) COMPUTE_COLUMN_MINMAX_STATS set to true Query: compute stats functional_parquet.hudi_non_partitioned +------------------------------------------+ | summary | +------------------------------------------+ | Updated 1 partition(s) and 15 column(s). | +------------------------------------------+ Fetched 1 row(s) in 0.53s [12:57:17 qchen@qifan-10229: IMPALA-10494_making_use_of_minmax_column_stats] sql dml.showstats.hudi Starting Impala Shell with no authentication using Python 2.7.16 Warning: live_progress only applies to interactive shell sessions, and is being skipped for now. Opened TCP connection to localhost:21000 Connected to localhost:21000 Server version: impalad version 4.0.0-SNAPSHOT DEBUG (build e45818b616ab4ff81fa512c8257bc8fac594094a) SHOW_COLUMN_MINMAX_STATS set to true Query: show column stats functional_parquet.hudi_non_partitioned +------------------------+---------------------------------------+------------------+--------+----------+-------------------+--------+---------+----------------------+--------------------+ | Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size | #Trues | #Falses | Min | Max | +------------------------+---------------------------------------+------------------+--------+----------+-------------------+--------+---------+----------------------+--------------------+ | _hoodie_commit_time | STRING | 2 | 0 | 14 | 14 | -1 | -1 | -1 | -1 | | _hoodie_commit_seqno | STRING | 97 | 0 | 20 | 20 | -1 | -1 | -1 | -1 | | _hoodie_record_key | STRING | 99 | 0 | 36 | 36 | -1 | -1 | -1 | -1 | | _hoodie_partition_path | STRING | 3 | 0 | 25 | 25 | -1 | -1 | -1 | -1 | | _hoodie_file_name | STRING | 6 | 0 | 71 | 70.68000030517578 | -1 | -1 | -1 | -1 | | _hoodie_is_deleted | BOOLEAN | 2 | 0 | 1 | 1 | 0 | 100 | -1 | -1 | | _row_key | STRING | 99 | 0 | 36 | 36 | -1 | -1 | -1 | -1 | | begin_lat | DOUBLE | 100 | 0 | 8 | 8 | -1 | -1 | 0.013803214965246391 | 0.9973157077943435 | | begin_lon | DOUBLE | 99 | 0 | 8 | 8 | -1 | -1 | 0.014143391676368022 | 0.991562254763212 | | driver | STRING | 2 | 0 | 10 | 10 | -1 | -1 | -1 | -1 | | end_lat | DOUBLE | 100 | 0 | 8 | 8 | -1 | -1 | 7.903052288528167E-4 | 0.9877514097604384 | | end_lon | DOUBLE | 99 | 0 | 8 | 8 | -1 | -1 | 0.029829569706356973 | 0.9978872086544781 | | fare | STRUCT<amount:DOUBLE,currency:STRING> | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | | partition | STRING | 3 | 0 | 25 | 25 | -1 | -1 | -1 | -1 | | rider | STRING | 2 | 0 | 9 | 9 | -1 | -1 | -1 | -1 | | timestamp | DOUBLE | 1 | 0 | 8 | 8 | -1 | -1 | 0.0 | 0.0 | +------------------------+---------------------------------------+------------------+--------+----------+-------------------+--------+---------+----------------------+--------------------+ Fetched 16 row(s) in 0.02s | Max Per-Host Resource Reservation: Memory=6.97MB Threads=5 | | Per-Host Resource Estimates: Memory=55MB | | Codegen disabled by planner | | Analyzed query: SELECT /* +straight_join */ a.`_hoodie_record_key` FROM | | functional_parquet.hudi_non_partitioned a, | | functional_parquet.hudi_non_partitioned b WHERE a.begin_lat = b.end_lat | | | | F02:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1 | | Per-Host Resources: mem-estimate=4.02MB mem-reservation=4.00MB thread-reservation=1 | | PLAN-ROOT SINK | | | output exprs: a.`_hoodie_record_key` | | | mem-estimate=4.00MB mem-reservation=4.00MB spill-buffer=2.00MB thread-reservation=0 | | | | | 04:EXCHANGE [UNPARTITIONED] | | mem-estimate=23.18KB mem-reservation=0B thread-reservation=0 | | tuple-ids=0,1 row-size=64B cardinality=100 | | in pipelines: 00(GETNEXT) | | | | F00:PLAN FRAGMENT [RANDOM] hosts=3 instances=3 | | Per-Host Resources: mem-estimate=34.95MB mem-reservation=2.95MB thread-reservation=2 runtime-filters-memory=1.00MB | | DATASTREAM SINK [FRAGMENT=F02, EXCHANGE=04, UNPARTITIONED] | | | mem-estimate=0B mem-reservation=0B thread-reservation=0 | | 02:HASH JOIN [INNER JOIN, BROADCAST] | | | hash predicates: a.begin_lat = b.end_lat | | | fk/pk conjuncts: a.begin_lat = b.end_lat | | | runtime filters: RF000[bloom] <- b.end_lat, RF001[min_max] <- b.end_lat | | | mem-estimate=1.94MB mem-reservation=1.94MB spill-buffer=64.00KB thread-reservation=0 | | | tuple-ids=0,1 row-size=64B cardinality=100 | | | in pipelines: 00(GETNEXT), 01(OPEN) | | | | | |--03:EXCHANGE [BROADCAST] | | | mem-estimate=16.00KB mem-reservation=0B thread-reservation=0 | | | tuple-ids=1 row-size=8B cardinality=100 | | | in pipelines: 01(GETNEXT) | | | | | 00:SCAN HDFS [functional_parquet.hudi_non_partitioned a, RANDOM] | | HDFS partitions=1/1 files=3 size=28.45KB | | runtime filters: RF001[min_max] -> a.begin_lat, RF000[bloom] -> a.begin_lat | | stored statistics: | | table: rows=100 size=28.45KB | | columns: all | | extrapolated-rows=disabled max-scan-range-rows=34 | | file formats: [PARQUET] | | mem-estimate=32.00MB mem-reservation=16.00KB thread-reservation=1 | | tuple-ids=0 row-size=56B cardinality=100 | | in pipelines: 00(GETNEXT) | | | | F01:PLAN FRAGMENT [RANDOM] hosts=3 instances=3 | | Per-Host Resources: mem-estimate=16.00MB mem-reservation=16.00KB thread-reservation=2 | | DATASTREAM SINK [FRAGMENT=F00, EXCHANGE=03, BROADCAST] | | | mem-estimate=0B mem-reservation=0B thread-reservation=0 | | 01:SCAN HDFS [functional_parquet.hudi_non_partitioned b, RANDOM] | | HDFS partitions=1/1 files=3 size=28.45KB | | stored statistics: | | table: rows=100 size=28.45KB | | columns: all | | extrapolated-rows=disabled max-scan-range-rows=34 | | file formats: [PARQUET] | | mem-estimate=16.00MB mem-reservation=16.00KB thread-reservation=1 | | tuple-ids=1 row-size=8B cardinality=100 | | in pipelines: 01(GETNEXT) | +--------------------------------------------------------------------------------------------------------------------+ Fetched 61 row(s) in 0.02s Query: select straight_join a._hoodie_record_key from hudi_non_partitioned a, hudi_non_partitioned b where a.begin_lat = b.end_lat Query submitted at: 2021-04-01 13:06:55 (Coordinator: http://qifan-10229:25000) Query progress can be monitored at: http://qifan-10229:25000/query_plan?query_id=5d4afb9f44b47432:2a456b9d00000000 Fetched 0 row(s) in 0.11s -- To view, visit http://gerrit.cloudera.org:8080/17075 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df Gerrit-Change-Number: 17075 Gerrit-PatchSet: 28 Gerrit-Owner: Qifan Chen <[email protected]> Gerrit-Reviewer: Aman Sinha <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Qifan Chen <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Thu, 01 Apr 2021 17:20:57 +0000 Gerrit-HasComments: Yes
