[
https://issues.apache.org/jira/browse/IMPALA-14757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Csaba Ringhofer updated IMPALA-14757:
-------------------------------------
Description:
set num_nodes=1;
with s as (select l_shipdate, l_orderkey, max(l_orderkey) over() maxkey from
tpch_parquet.lineitem) select * from s where maxkey = l_orderkey;
summary:
{code}
+--------------+--------+-------+----------+----------+-------+------------+-----------+---------------+-----------------------+
| Operator | #Hosts | #Inst | Avg Time | Max Time | #Rows | Est. #Rows |
Peak Mem | Est. Peak Mem | Detail |
+--------------+--------+-------+----------+----------+-------+------------+-----------+---------------+-----------------------+
| F00:ROOT | 1 | 1 | 22.67us | 22.67us | | |
4.02 MB | 21.75 MB | |
| 02:SELECT | 1 | 1 | 24.14ms | 24.14ms | 2 | 600.12K |
24.00 KB | 0 B | |
| 01:ANALYTIC | 1 | 1 | 723.74ms | 723.74ms | 6.00M | 6.00M |
178.09 MB | 4.00 MB | |
| 00:SCAN HDFS | 1 | 1 | 10.94ms | 10.94ms | 6.00M | 6.00M |
29.22 MB | 160.00 MB | tpch_parquet.lineitem |
+--------------+--------+-------+----------+----------+-------+------------+-----------+---------------+-----------------------+
{code}
The analytic node consumer 178MB vs the estimated 4MB.
Another issue is that the results are heavily overestimated (2 vs 600.12K),
the planner should realize that maxkey will have a single value for all rows
and esimate selectivity based on NDV of column.
Note that this query would be more mem efficient if it was rewritten to use
scalar subquery the get the max (at the cost of reading the table twice) or to
use ORDER BY.
was:
set num_nodes=1;
with s as (select l_shipdate, l_orderkey, max(l_orderkey) over() maxkey from
tpch_parquet.lineitem) select * from s where maxkey = l_orderkey;
summary:
{code}
+--------------+--------+-------+----------+----------+-------+------------+-----------+---------------+-----------------------+
| Operator | #Hosts | #Inst | Avg Time | Max Time | #Rows | Est. #Rows |
Peak Mem | Est. Peak Mem | Detail |
+--------------+--------+-------+----------+----------+-------+------------+-----------+---------------+-----------------------+
| F00:ROOT | 1 | 1 | 22.67us | 22.67us | | |
4.02 MB | 21.75 MB | |
| 02:SELECT | 1 | 1 | 24.14ms | 24.14ms | 2 | 600.12K |
24.00 KB | 0 B | |
| 01:ANALYTIC | 1 | 1 | 723.74ms | 723.74ms | 6.00M | 6.00M |
178.09 MB | 4.00 MB | |
| 00:SCAN HDFS | 1 | 1 | 10.94ms | 10.94ms | 6.00M | 6.00M |
29.22 MB | 160.00 MB | tpch_parquet.lineitem |
+--------------+--------+-------+----------+----------+-------+------------+-----------+---------------+-----------------------+
{code}
The analytic node consumer 178MB vs the estimated 4MB.
Another issue is that the results are heavily overestimated, the planner should
realize that maxkey will have a single value for all rows and esimate
selectivity based on NDV of column.
Note that this query would be more mem efficient if it was rewritten to use
scalar subquery the get the max (at the cost of reading the table twice) or to
use ORDER BY.
> Analytic functions' mem usage can be underestimated
> ---------------------------------------------------
>
> Key: IMPALA-14757
> URL: https://issues.apache.org/jira/browse/IMPALA-14757
> Project: IMPALA
> Issue Type: Bug
> Components: Frontend
> Reporter: Csaba Ringhofer
> Priority: Major
>
> set num_nodes=1;
> with s as (select l_shipdate, l_orderkey, max(l_orderkey) over() maxkey
> from tpch_parquet.lineitem) select * from s where maxkey = l_orderkey;
> summary:
> {code}
> +--------------+--------+-------+----------+----------+-------+------------+-----------+---------------+-----------------------+
> | Operator | #Hosts | #Inst | Avg Time | Max Time | #Rows | Est. #Rows |
> Peak Mem | Est. Peak Mem | Detail |
> +--------------+--------+-------+----------+----------+-------+------------+-----------+---------------+-----------------------+
> | F00:ROOT | 1 | 1 | 22.67us | 22.67us | | |
> 4.02 MB | 21.75 MB | |
> | 02:SELECT | 1 | 1 | 24.14ms | 24.14ms | 2 | 600.12K |
> 24.00 KB | 0 B | |
> | 01:ANALYTIC | 1 | 1 | 723.74ms | 723.74ms | 6.00M | 6.00M |
> 178.09 MB | 4.00 MB | |
> | 00:SCAN HDFS | 1 | 1 | 10.94ms | 10.94ms | 6.00M | 6.00M |
> 29.22 MB | 160.00 MB | tpch_parquet.lineitem |
> +--------------+--------+-------+----------+----------+-------+------------+-----------+---------------+-----------------------+
> {code}
> The analytic node consumer 178MB vs the estimated 4MB.
> Another issue is that the results are heavily overestimated (2 vs 600.12K),
> the planner should realize that maxkey will have a single value for all rows
> and esimate selectivity based on NDV of column.
> Note that this query would be more mem efficient if it was rewritten to use
> scalar subquery the get the max (at the cost of reading the table twice) or
> to use ORDER BY.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]