[
https://issues.apache.org/jira/browse/IMPALA-10084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177209#comment-17177209
]
Sahil Takiar commented on IMPALA-10084:
---------------------------------------
While testing this, we might want to consider how to correctly use the
PlannerTestOption / ResultFilters. There are various options to consider:
DO_NOT_VALIDATE_ROWCOUNT_ESTIMATION_FOR_PARTITIONS and
DISABLE_HDFS_NUM_ROWS_ESTIMATE.
> Display the number of estimated rows for a table
> ------------------------------------------------
>
> Key: IMPALA-10084
> URL: https://issues.apache.org/jira/browse/IMPALA-10084
> Project: IMPALA
> Issue Type: Sub-task
> Reporter: Sahil Takiar
> Priority: Major
>
> AFAICT, there is no way to determine the number of rows estimated for a table
> when row counts have been estimated via file size:
> {code:java}
> [localhost:21000] default> create table test (col int);
> [localhost:21000] default> insert into table test values (1), (2), (3), (4),
> (5);
> [localhost:21000] default> show table stats test;
> +-------+--------+------+--------------+-------------------+--------+-------------------+--------------------------------------------+
> | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format |
> Incremental stats | Location |
> +-------+--------+------+--------------+-------------------+--------+-------------------+--------------------------------------------+
> | -1 | 1 | 10B | NOT CACHED | NOT CACHED | TEXT | false
> | hdfs://localhost:20500/test-warehouse/test |
> +-------+--------+------+--------------+-------------------+--------+-------------------+--------------------------------------------+
> [localhost:21000] default> explain select * from test order by col limit 10;
> +------------------------------------------------------------------------------------+
> | Explain String
> |
> +------------------------------------------------------------------------------------+
> | Max Per-Host Resource Reservation: Memory=8.00KB Threads=3
> |
> | Per-Host Resource Estimates: Memory=32MB
> |
> | WARNING: The following tables are missing relevant table and/or column
> statistics. |
> | default.test
> |
> |
> |
> | PLAN-ROOT SINK
> |
> | |
> |
> | 02:MERGING-EXCHANGE [UNPARTITIONED]
> |
> | | order by: col ASC
> |
> | | limit: 10
> |
> | |
> |
> | 01:TOP-N [LIMIT=10]
> |
> | | order by: col ASC
> |
> | | row-size=4B cardinality=3
> |
> | |
> |
> | 00:SCAN HDFS [default.test]
> |
> | HDFS partitions=1/1 files=1 size=10B
> |
> | row-size=4B cardinality=3
> |
> +------------------------------------------------------------------------------------+
> [localhost:21000] default> set explain_level=3;
> localhost:21000] default> explain select * from test order by col limit 10;
> +--------------------------------------------------------------------------------------+
> | Explain String
> |
> +--------------------------------------------------------------------------------------+
> | Max Per-Host Resource Reservation: Memory=8.00KB Threads=3
> |
> | Per-Host Resource Estimates: Memory=32MB
> |
> | WARNING: The following tables are missing relevant table and/or column
> statistics. |
> | default.test
> |
> | Analyzed query: SELECT * FROM `default`.test ORDER BY col ASC LIMIT CAST(10
> AS |
> | TINYINT)
> |
> |
> |
> | F01:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1
> |
> | Per-Host Resources: mem-estimate=16.00KB mem-reservation=0B
> thread-reservation=1 |
> | PLAN-ROOT SINK
> |
> | | output exprs: col
> |
> | | mem-estimate=0B mem-reservation=0B thread-reservation=0
> |
> | |
> |
> | 02:MERGING-EXCHANGE [UNPARTITIONED]
> |
> | order by: col ASC
> |
> | limit: 10
> |
> | mem-estimate=16.00KB mem-reservation=0B thread-reservation=0
> |
> | tuple-ids=1 row-size=4B cardinality=3
> |
> | in pipelines: 01(GETNEXT)
> |
> |
> |
> | F00:PLAN FRAGMENT [RANDOM] hosts=1 instances=1
> |
> | Per-Host Resources: mem-estimate=32.00MB mem-reservation=8.00KB
> thread-reservation=2 |
> | DATASTREAM SINK [FRAGMENT=F01, EXCHANGE=02, UNPARTITIONED]
> |
> | | mem-estimate=0B mem-reservation=0B thread-reservation=0
> |
> | 01:TOP-N [LIMIT=10]
> |
> | | order by: col ASC
> |
> | | mem-estimate=12B mem-reservation=0B thread-reservation=0
> |
> | | tuple-ids=1 row-size=4B cardinality=3
> |
> | | in pipelines: 01(GETNEXT), 00(OPEN)
> |
> | |
> |
> | 00:SCAN HDFS [default.test, RANDOM]
> |
> | HDFS partitions=1/1 files=1 size=10B
> |
> | stored statistics:
> |
> | table: rows=unavailable size=unavailable
> |
> | columns: unavailable
> |
> | extrapolated-rows=disabled max-scan-range-rows=unavailable
> |
> | mem-estimate=32.00MB mem-reservation=8.00KB thread-reservation=1
> |
> | tuple-ids=0 row-size=4B cardinality=3
> |
> | in pipelines: 00(GETNEXT)
> |
> +--------------------------------------------------------------------------------------+
> [localhost:21000] default> create table part_test (col int) partitioned by
> (partcol int);
> [localhost:21000] default> alter table part_test add partition (partcol=1);
> [localhost:21000] default> insert into table part_test partition (partcol=1)
> values (1), (2), (3), (4), (5);
> [localhost:21000] default> show table stats part_test;
> +---------+-------+--------+------+--------------+-------------------+--------+-------------------+-----------------------------------------------------------+
> | partcol | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format
> | Incremental stats | Location
> |
> +---------+-------+--------+------+--------------+-------------------+--------+-------------------+-----------------------------------------------------------+
> | 1 | -1 | 1 | 10B | NOT CACHED | NOT CACHED | TEXT
> | false |
> hdfs://localhost:20500/test-warehouse/part_test/partcol=1 |
> | Total | -1 | 1 | 10B | 0B | |
> | |
> |
> +---------+-------+--------+------+--------------+-------------------+--------+-------------------+-----------------------------------------------------------+
> [localhost:21000] default> show partitions part_test;
> +---------+-------+--------+------+--------------+-------------------+--------+-------------------+-----------------------------------------------------------+
> | partcol | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format
> | Incremental stats | Location
> |
> +---------+-------+--------+------+--------------+-------------------+--------+-------------------+-----------------------------------------------------------+
> | 1 | -1 | 1 | 10B | NOT CACHED | NOT CACHED | TEXT
> | false |
> hdfs://localhost:20500/test-warehouse/part_test/partcol=1 |
> | Total | -1 | 1 | 10B | 0B | |
> | |
> |
> +---------+-------+--------+------+--------------+-------------------+--------+-------------------+-----------------------------------------------------------+
> [localhost:21000] default> explain select * from part_test order by col limit
> 10;
> +--------------------------------------------------------------------------------------+
> | Explain String
> |
> +--------------------------------------------------------------------------------------+
> | Max Per-Host Resource Reservation: Memory=8.00KB Threads=3
> |
> | Per-Host Resource Estimates: Memory=32MB
> |
> | WARNING: The following tables are missing relevant table and/or column
> statistics. |
> | default.part_test
> |
> | Analyzed query: SELECT * FROM `default`.part_test ORDER BY col ASC LIMIT
> CAST(10 |
> | AS TINYINT)
> |
> |
> |
> | F01:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1
> |
> | Per-Host Resources: mem-estimate=16.00KB mem-reservation=0B
> thread-reservation=1 |
> | PLAN-ROOT SINK
> |
> | | output exprs: col, partcol
> |
> | | mem-estimate=0B mem-reservation=0B thread-reservation=0
> |
> | |
> |
> | 02:MERGING-EXCHANGE [UNPARTITIONED]
> |
> | order by: col ASC
> |
> | limit: 10
> |
> | mem-estimate=16.00KB mem-reservation=0B thread-reservation=0
> |
> | tuple-ids=1 row-size=8B cardinality=1
> |
> | in pipelines: 01(GETNEXT)
> |
> |
> |
> | F00:PLAN FRAGMENT [RANDOM] hosts=1 instances=1
> |
> | Per-Host Resources: mem-estimate=32.00MB mem-reservation=8.00KB
> thread-reservation=2 |
> | DATASTREAM SINK [FRAGMENT=F01, EXCHANGE=02, UNPARTITIONED]
> |
> | | mem-estimate=0B mem-reservation=0B thread-reservation=0
> |
> | 01:TOP-N [LIMIT=10]
> |
> | | order by: col ASC
> |
> | | mem-estimate=8B mem-reservation=0B thread-reservation=0
> |
> | | tuple-ids=1 row-size=8B cardinality=1
> |
> | | in pipelines: 01(GETNEXT), 00(OPEN)
> |
> | |
> |
> | 00:SCAN HDFS [default.part_test, RANDOM]
> |
> | HDFS partitions=1/1 files=1 size=10B
> |
> | stored statistics:
> |
> | table: rows=unavailable size=unavailable
> |
> | partitions: 0/1 rows=1
> |
> | columns missing stats: col
> |
> | extrapolated-rows=disabled max-scan-range-rows=unavailable
> |
> | mem-estimate=32.00MB mem-reservation=8.00KB thread-reservation=1
> |
> | tuple-ids=0 row-size=8B cardinality=1
> |
> | in pipelines: 00(GETNEXT)
> |
> +--------------------------------------------------------------------------------------+
> {code}
> It would be nice if there was a field in the explain plan similar to
> "extrapolated-rows" that lists the estimated row count for the table.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]