[jira] [Updated] (IMPALA-14757) Analytic functions' mem usage can be underestimated

Csaba Ringhofer (Jira) Thu, 19 Feb 2026 10:17:12 -0800


     [ 
https://issues.apache.org/jira/browse/IMPALA-14757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Csaba Ringhofer updated IMPALA-14757:
-------------------------------------
    Description: 
set num_nodes=1;
with s as  (select l_shipdate, l_orderkey,  max(l_orderkey) over() maxkey from 
tpch_parquet.lineitem) select * from s  where maxkey = l_orderkey;

summary:
{code}
+--------------+--------+-------+----------+----------+-------+------------+-----------+---------------+-----------------------+
| Operator     | #Hosts | #Inst | Avg Time | Max Time | #Rows | Est. #Rows | 
Peak Mem  | Est. Peak Mem | Detail                |
+--------------+--------+-------+----------+----------+-------+------------+-----------+---------------+-----------------------+
| F00:ROOT     | 1      | 1     | 22.67us  | 22.67us  |       |            | 
4.02 MB   | 21.75 MB      |                       |
| 02:SELECT    | 1      | 1     | 24.14ms  | 24.14ms  | 2     | 600.12K    | 
24.00 KB  | 0 B           |                       |
| 01:ANALYTIC  | 1      | 1     | 723.74ms | 723.74ms | 6.00M | 6.00M      | 
178.09 MB | 4.00 MB       |                       |
| 00:SCAN HDFS | 1      | 1     | 10.94ms  | 10.94ms  | 6.00M | 6.00M      | 
29.22 MB  | 160.00 MB     | tpch_parquet.lineitem |
+--------------+--------+-------+----------+----------+-------+------------+-----------+---------------+-----------------------+
{code}

The analytic node consumer 178MB vs the estimated 4MB.

Another issue is that the results are heavily overestimated (2  vs 600.12K), 
the planner should realize that maxkey will have a single value for all rows 
and esimate selectivity based on NDV of column.

Note that this query would be more mem efficient if it was rewritten to use 
scalar subquery the get the max (at the cost of reading the table twice) or to 
use ORDER BY.

  was:
set num_nodes=1;
with s as  (select l_shipdate, l_orderkey,  max(l_orderkey) over() maxkey from 
tpch_parquet.lineitem) select * from s  where maxkey = l_orderkey;

summary:
{code}
+--------------+--------+-------+----------+----------+-------+------------+-----------+---------------+-----------------------+
| Operator     | #Hosts | #Inst | Avg Time | Max Time | #Rows | Est. #Rows | 
Peak Mem  | Est. Peak Mem | Detail                |
+--------------+--------+-------+----------+----------+-------+------------+-----------+---------------+-----------------------+
| F00:ROOT     | 1      | 1     | 22.67us  | 22.67us  |       |            | 
4.02 MB   | 21.75 MB      |                       |
| 02:SELECT    | 1      | 1     | 24.14ms  | 24.14ms  | 2     | 600.12K    | 
24.00 KB  | 0 B           |                       |
| 01:ANALYTIC  | 1      | 1     | 723.74ms | 723.74ms | 6.00M | 6.00M      | 
178.09 MB | 4.00 MB       |                       |
| 00:SCAN HDFS | 1      | 1     | 10.94ms  | 10.94ms  | 6.00M | 6.00M      | 
29.22 MB  | 160.00 MB     | tpch_parquet.lineitem |
+--------------+--------+-------+----------+----------+-------+------------+-----------+---------------+-----------------------+
{code}

The analytic node consumer 178MB vs the estimated 4MB.

Another issue is that the results are heavily overestimated, the planner should 
realize that maxkey will have a single value for all rows and esimate 
selectivity based on NDV of column.

Note that this query would be more mem efficient if it was rewritten to use 
scalar subquery the get the max (at the cost of reading the table twice) or to 
use ORDER BY.


> Analytic functions' mem usage can be underestimated
> ---------------------------------------------------
>
>                 Key: IMPALA-14757
>                 URL: https://issues.apache.org/jira/browse/IMPALA-14757
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>            Reporter: Csaba Ringhofer
>            Priority: Major
>
> set num_nodes=1;
> with s as  (select l_shipdate, l_orderkey,  max(l_orderkey) over() maxkey 
> from tpch_parquet.lineitem) select * from s  where maxkey = l_orderkey;
> summary:
> {code}
> +--------------+--------+-------+----------+----------+-------+------------+-----------+---------------+-----------------------+
> | Operator     | #Hosts | #Inst | Avg Time | Max Time | #Rows | Est. #Rows | 
> Peak Mem  | Est. Peak Mem | Detail                |
> +--------------+--------+-------+----------+----------+-------+------------+-----------+---------------+-----------------------+
> | F00:ROOT     | 1      | 1     | 22.67us  | 22.67us  |       |            | 
> 4.02 MB   | 21.75 MB      |                       |
> | 02:SELECT    | 1      | 1     | 24.14ms  | 24.14ms  | 2     | 600.12K    | 
> 24.00 KB  | 0 B           |                       |
> | 01:ANALYTIC  | 1      | 1     | 723.74ms | 723.74ms | 6.00M | 6.00M      | 
> 178.09 MB | 4.00 MB       |                       |
> | 00:SCAN HDFS | 1      | 1     | 10.94ms  | 10.94ms  | 6.00M | 6.00M      | 
> 29.22 MB  | 160.00 MB     | tpch_parquet.lineitem |
> +--------------+--------+-------+----------+----------+-------+------------+-----------+---------------+-----------------------+
> {code}
> The analytic node consumer 178MB vs the estimated 4MB.
> Another issue is that the results are heavily overestimated (2  vs 600.12K), 
> the planner should realize that maxkey will have a single value for all rows 
> and esimate selectivity based on NDV of column.
> Note that this query would be more mem efficient if it was rewritten to use 
> scalar subquery the get the max (at the cost of reading the table twice) or 
> to use ORDER BY.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (IMPALA-14757) Analytic functions' mem usage can be underestimated

Reply via email to