[jira] [Updated] (IMPALA-10377) Improve the accuracy of resource estimation

Tim Armstrong (Jira) Fri, 04 Dec 2020 09:07:54 -0800


     [ 
https://issues.apache.org/jira/browse/IMPALA-10377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Tim Armstrong updated IMPALA-10377:
-----------------------------------
    Target Version: Impala 4.0

> Improve the accuracy of resource estimation
> -------------------------------------------
>
>                 Key: IMPALA-10377
>                 URL: https://issues.apache.org/jira/browse/IMPALA-10377
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Frontend
>    Affects Versions: Impala 3.4.0
>            Reporter: liuyao
>            Assignee: liuyao
>            Priority: Major
>              Labels: estimate, memory, statistics
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> PlanNode does not consider some factors when estimating memory, this will 
> cause a large error rate
>  
> AggregationNode
>  
> 1.The memory occupied by hash table's own data structure is not considered. 
> Hash table inserts a new value, which will add a bucket. The size of a bucket 
> is 16 bytes.
> 2.When estimating the NDV of merge aggregation, if there are multiple 
> grouping exprs, it may be divided by the number of Fragment Instances several 
> times, and it should be divided only once.
> 3.When estimating the NDV of merge aggregation, and there are multiple 
> grouping exprs, the estimated memory is much smaller than the actual use.
> 4.If there is no grouping exprs, the estimated memory is much larger than the 
> actual use.
> 5.If the NDV of grouping exprs is very small, the estimated memory is much 
> larger than the actual use.
>  
> SortNode
> 1.Estimate the memory usage of external sort. the estimated memory is much 
> smaller than the actual use.
>  
>  
> HashJoinNode
> 1.The memory occupied by hash table's own data structure is not 
> considered.Hash Table will keep duplicate data, so the size of DuplicateNode 
> should be considered.
> 2.Hash table will create multiple buckets in advance. The size of these 
> buckets should be considered.
>  
> KuduScanNode
> 1.Estimate memory by scanning all columns,the estimated memory is much larger 
> than the actual use.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (IMPALA-10377) Improve the accuracy of resource estimation

Reply via email to