[ 
https://issues.apache.org/jira/browse/IMPALA-9744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17132670#comment-17132670
 ] 

Qifan Chen commented on IMPALA-9744:
------------------------------------

Query plan with NJ into the table.
{code:java}
Query: explain select * from alltypes_insert_only_1
where int_col > (select min(1) from alltypes_insert_only_1)
+------------------------------------------------------------------------------------+
| Explain String |
+------------------------------------------------------------------------------------+
| Max Per-Host Resource Reservation: Memory=96.00KB Threads=5 |
| Per-Host Resource Estimates: Memory=197MB |
| WARNING: The following tables have potentially corrupt table statistics. |
| Drop and re-compute statistics to resolve this problem. |
| default.alltypes_insert_only_1 |
| WARNING: The following tables are missing relevant table and/or column 
statistics. |
| default.alltypes_insert_only_1 |
| |
| PLAN-ROOT SINK |
| | |
| 03:NESTED LOOP JOIN [INNER JOIN, BROADCAST] |
| | predicates: int_col > min(1) |
| | row-size=81B cardinality=0 |
| | |
| |--06:EXCHANGE [UNPARTITIONED] |
| | | |
| | 00:SCAN HDFS [default.alltypes_insert_only_1] |
| | HDFS partitions=1/1 files=1 size=7.80KB |
| | row-size=80B cardinality=0 |
| | |
| 05:AGGREGATE [FINALIZE] |
| | output: min:merge(1) |
| | row-size=1B cardinality=1 |
| | |
| 04:EXCHANGE [UNPARTITIONED] |
| | |
| 02:AGGREGATE |
| | output: min(1) |
| | row-size=1B cardinality=1 |
| | |
| 01:SCAN HDFS [default.alltypes_insert_only_1] |
| HDFS partitions=1/1 files=1 size=7.80KB |
| partition key scan |
| row-size=0B cardinality=0 |{code}

+------------------------------------------------------------------------------------+

> Treat corrupt table stats as missing to avoid bad plans
> -------------------------------------------------------
>
>                 Key: IMPALA-9744
>                 URL: https://issues.apache.org/jira/browse/IMPALA-9744
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>            Reporter: Tim Armstrong
>            Assignee: Qifan Chen
>            Priority: Major
>              Labels: ramp-up
>
> We currently detect corrupt stats (0 rows but data in partition) but only 
> flag it. The 0 row count is used for planning. I ran into a scenario where 
> this lead to an extremely pathological plan - the 0 row count lead to 
> flipping a nested loop join to put the big table on the build side and 
> running out of memory.
> I propose doing something very conservative to avoid this scenario: if we see 
> corrupt stats in any partition, and the row count is computed to be zero, 
> ignore the row count and treat it the same as missing stats in the planner.
> Here's an example where we end up with corrupt stats. Warning: this can 
> remove the data file from your alltypes type, I recommend copying the file to 
> a different location before running this.
> {noformat}
> # In beeline against HS2
> !connect jdbc:hive2://localhost:11050 hive org.apache.hive.jdbc.HiveDrive
> set hive.stats.autogather=true;
> CREATE TABLE `alltypes_insert_only`(
>    `id` int COMMENT 'Add a comment',
>    `bool_col` boolean,
>    `tinyint_col` tinyint,
>    `smallint_col` smallint,
>    `int_col` int,
>    `bigint_col` bigint,
>    `float_col` float,
>    `double_col` double,
>    `date_string_col` string,
>    `string_col` string,
>    `timestamp_col` timestamp)
>  PARTITIONED BY (
>    `year` int,
>    `month` int)
>  STORED AS PARQUET
>  TBLPROPERTIES ("transactional"="true", 
> "transactional_properties"="insert_only");
> load data inpath 
> 'hdfs://172.19.0.1:20500/test-warehouse/alltypes_parquet/year=2009/month=1/154473eafa08ea0e-f9d70e7100000004_1040780996_data.0.parq'
>  into table alltypes_insert_only partition (year=2009,month=9);
> # In Impala
> show table stats alltypes_insert_only;
> +-------+-------+-------+--------+--------+--------------+-------------------+---------+-------------------+----------------------------------------------------------------------------------------+
> | year  | month | #Rows | #Files | Size   | Bytes Cached | Cache Replication 
> | Format  | Incremental stats | Location                                      
>                                          |
> +-------+-------+-------+--------+--------+--------------+-------------------+---------+-------------------+----------------------------------------------------------------------------------------+
> | 2009  | 10    | 0     | 1      | 7.75KB | NOT CACHED   | NOT CACHED        
> | PARQUET | false             | 
> hdfs://172.19.0.1:20500/test-warehouse/managed/alltypes_insert_only/year=2009/month=10
>  |
> | Total |       | -1    | 1      | 7.75KB | 0B           |                   
> |         |                   |                                               
>                                          |
> +-------+-------+-------+--------+--------+--------------+-------------------+---------+-------------------+----------------------------------------------------------------------------------------+
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to