[jira] [Updated] (IMPALA-13052) Sampling aggregate result sizes are underestimated

Csaba Ringhofer (Jira) Thu, 02 May 2024 02:47:27 -0700


     [ 
https://issues.apache.org/jira/browse/IMPALA-13052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Csaba Ringhofer updated IMPALA-13052:
-------------------------------------
    Description: 
Sampling aggregates (sample, appx_median, histogram) return a string that can 
be quite large, but the planner assumes it to have a fixed small size.

Examples:
select sample(l_orderkey) from tpch.lineitem;
according to plan: row-size=12B
in reality: TotalBytesSent: 254.45 KB (this is  single row sent by a host)

select appx_median(l_orderkey) from tpch.lineitem;
according to plan: row-size= 8B
in reality: TotalBytesSent: 254.68 KB (this is  single row sent by a host)

select histogram(l_orderkey) from tpch.lineitem;
according to plan: row-size=12B
in reality: TotalBytesSent: 254.35 KB (this is  single row sent by a host)

This may be also relevant for datasketches functions, haven't checked thos  yet.

This can lead to highly underestimating the memory needs of grouping 
aggregators:
select appx_median(l_shipmode) from lineitem group by l_orderkey order by 1 
limit 1
04:AGGREGATE  FINALIZE Peak Mem:  2.19 GB   Est. Peak Mem:  18.00 MB
01:AGGREGATE STREAMING  Peak Mem:   2.37 GB   Est. Peak Mem:  45.79 MB

  was:
Sampling aggregates (sample, appx_median, histogram) return a string that can 
be quite large, but the planner assumes it to have a fixed small size.

Examples:
select sample(l_orderkey) from tpch.lineitem;
according to plan: row-size=12B
in reality: TotalBytesSent: 254.45 KB (this is  single row sent by a host)

select appx_median(l_orderkey) from tpch.lineitem;
according to plan: row-size= 8B
in reality: TotalBytesSent: 254.68 KB (this is  single row sent by a host)

select histogram(l_orderkey) from tpch.lineitem;
according to plan: row-size=12B
in reality: TotalBytesSent: 254.35 KB (this is  single row sent by a host)

This may be also relevant for datasketches functions.



> Sampling aggregate result sizes are underestimated
> --------------------------------------------------
>
>                 Key: IMPALA-13052
>                 URL: https://issues.apache.org/jira/browse/IMPALA-13052
>             Project: IMPALA
>          Issue Type: Bug
>            Reporter: Csaba Ringhofer
>            Priority: Major
>
> Sampling aggregates (sample, appx_median, histogram) return a string that can 
> be quite large, but the planner assumes it to have a fixed small size.
> Examples:
> select sample(l_orderkey) from tpch.lineitem;
> according to plan: row-size=12B
> in reality: TotalBytesSent: 254.45 KB (this is  single row sent by a host)
> select appx_median(l_orderkey) from tpch.lineitem;
> according to plan: row-size= 8B
> in reality: TotalBytesSent: 254.68 KB (this is  single row sent by a host)
> select histogram(l_orderkey) from tpch.lineitem;
> according to plan: row-size=12B
> in reality: TotalBytesSent: 254.35 KB (this is  single row sent by a host)
> This may be also relevant for datasketches functions, haven't checked thos  
> yet.
> This can lead to highly underestimating the memory needs of grouping 
> aggregators:
> select appx_median(l_shipmode) from lineitem group by l_orderkey order by 1 
> limit 1
> 04:AGGREGATE  FINALIZE Peak Mem:  2.19 GB   Est. Peak Mem:  18.00 MB
> 01:AGGREGATE STREAMING  Peak Mem:   2.37 GB   Est. Peak Mem:  45.79 MB



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (IMPALA-13052) Sampling aggregate result sizes are underestimated

Reply via email to