[ 
https://issues.apache.org/jira/browse/IMPALA-14757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Csaba Ringhofer updated IMPALA-14757:
-------------------------------------
    Labels: ramp-up  (was: )

> Analytic functions' mem usage can be underestimated
> ---------------------------------------------------
>
>                 Key: IMPALA-14757
>                 URL: https://issues.apache.org/jira/browse/IMPALA-14757
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>            Reporter: Csaba Ringhofer
>            Priority: Major
>              Labels: ramp-up
>
> set num_nodes=1;
> with s as  (select l_shipdate, l_orderkey,  max(l_orderkey) over() maxkey 
> from tpch_parquet.lineitem) select * from s  where maxkey = l_orderkey;
> summary:
> {code}
> +--------------+--------+-------+----------+----------+-------+------------+-----------+---------------+-----------------------+
> | Operator     | #Hosts | #Inst | Avg Time | Max Time | #Rows | Est. #Rows | 
> Peak Mem  | Est. Peak Mem | Detail                |
> +--------------+--------+-------+----------+----------+-------+------------+-----------+---------------+-----------------------+
> | F00:ROOT     | 1      | 1     | 22.67us  | 22.67us  |       |            | 
> 4.02 MB   | 21.75 MB      |                       |
> | 02:SELECT    | 1      | 1     | 24.14ms  | 24.14ms  | 2     | 600.12K    | 
> 24.00 KB  | 0 B           |                       |
> | 01:ANALYTIC  | 1      | 1     | 723.74ms | 723.74ms | 6.00M | 6.00M      | 
> 178.09 MB | 4.00 MB       |                       |
> | 00:SCAN HDFS | 1      | 1     | 10.94ms  | 10.94ms  | 6.00M | 6.00M      | 
> 29.22 MB  | 160.00 MB     | tpch_parquet.lineitem |
> +--------------+--------+-------+----------+----------+-------+------------+-----------+---------------+-----------------------+
> {code}
> The analytic node consumer 178MB vs the estimated 4MB.
> Note that the analytic node can spill, so if it would actually hit the 
> mem_limit then it would start spilling and not increase its memory usage 
> higher.
> Another issue is that the results are heavily overestimated (2  vs 600.12K), 
> the planner should realize that maxkey will have a single value for all rows 
> and esimate selectivity based on NDV of column.
> Note that this query would be more mem efficient if it was rewritten to use 
> scalar subquery the get the max (at the cost of reading the table twice) or 
> to use ORDER BY.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to