[
https://issues.apache.org/jira/browse/IMPALA-13052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Csaba Ringhofer updated IMPALA-13052:
-------------------------------------
Description:
Sampling aggregates (sample, appx_median, histogram) return a string that can
be quite large, but the planner assumes it to have a fixed small size.
Examples:
select sample(l_orderkey) from tpch.lineitem;
according to plan: row-size=12B
in reality: TotalBytesSent: 254.45 KB (this is single row sent by a host)
select appx_median(l_orderkey) from tpch.lineitem;
according to plan: row-size= 8B
in reality: TotalBytesSent: 254.68 KB (this is single row sent by a host)
select histogram(l_orderkey) from tpch.lineitem;
according to plan: row-size=12B
in reality: TotalBytesSent: 254.35 KB (this is single row sent by a host)
This may be also relevant for datasketches functions, haven't checked thos yet.
This can lead to highly underestimating the memory needs of grouping
aggregators:
select appx_median(l_shipmode) from lineitem group by l_orderkey order by 1
limit 1
04:AGGREGATE FINALIZE Peak Mem: 2.19 GB Est. Peak Mem: 18.00 MB
01:AGGREGATE STREAMING Peak Mem: 2.37 GB Est. Peak Mem: 45.79 MB
was:
Sampling aggregates (sample, appx_median, histogram) return a string that can
be quite large, but the planner assumes it to have a fixed small size.
Examples:
select sample(l_orderkey) from tpch.lineitem;
according to plan: row-size=12B
in reality: TotalBytesSent: 254.45 KB (this is single row sent by a host)
select appx_median(l_orderkey) from tpch.lineitem;
according to plan: row-size= 8B
in reality: TotalBytesSent: 254.68 KB (this is single row sent by a host)
select histogram(l_orderkey) from tpch.lineitem;
according to plan: row-size=12B
in reality: TotalBytesSent: 254.35 KB (this is single row sent by a host)
This may be also relevant for datasketches functions.
> Sampling aggregate result sizes are underestimated
> --------------------------------------------------
>
> Key: IMPALA-13052
> URL: https://issues.apache.org/jira/browse/IMPALA-13052
> Project: IMPALA
> Issue Type: Bug
> Reporter: Csaba Ringhofer
> Priority: Major
>
> Sampling aggregates (sample, appx_median, histogram) return a string that can
> be quite large, but the planner assumes it to have a fixed small size.
> Examples:
> select sample(l_orderkey) from tpch.lineitem;
> according to plan: row-size=12B
> in reality: TotalBytesSent: 254.45 KB (this is single row sent by a host)
> select appx_median(l_orderkey) from tpch.lineitem;
> according to plan: row-size= 8B
> in reality: TotalBytesSent: 254.68 KB (this is single row sent by a host)
> select histogram(l_orderkey) from tpch.lineitem;
> according to plan: row-size=12B
> in reality: TotalBytesSent: 254.35 KB (this is single row sent by a host)
> This may be also relevant for datasketches functions, haven't checked thos
> yet.
> This can lead to highly underestimating the memory needs of grouping
> aggregators:
> select appx_median(l_shipmode) from lineitem group by l_orderkey order by 1
> limit 1
> 04:AGGREGATE FINALIZE Peak Mem: 2.19 GB Est. Peak Mem: 18.00 MB
> 01:AGGREGATE STREAMING Peak Mem: 2.37 GB Est. Peak Mem: 45.79 MB
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]