[ 
https://issues.apache.org/jira/browse/IMPALA-10377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17300682#comment-17300682
 ] 

ASF subversion and git services commented on IMPALA-10377:
----------------------------------------------------------

Commit 1a01bfe831b548204cde8087def51a2f27b40cb4 in impala's branch 
refs/heads/master from liuyao
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=1a01bfe ]

IMPALA-10377: Improve the accuracy of resource estimation

PlanNode does not consider some factors when estimating memory,
this will cause a large error rate

AggregationNode
1.MemoryEstimate = Ndv * (AvgRowSize + SizeOfBucket)
2.When estimating the Ndv of merge aggregation, Ndv should be
  divided only once.
3.If there is no grouping exprs, MemoryEstimate =
  MIN_PLAIN_AGG_MEM

SortNode
1.MemoryEstimate = Cardinality * AvgRowSize. Memory used when
  there is enough memory

HashJoinNode
1.MemoryEstimate= DataRows + Buckets + DuplicateNodes,
  DataRows = RightTableCardinality * AvgRowSize,
  Buckets= roundUpToPowerOf2(RightTableCardinality) *
           SizeOfBucket,
  DuplicateNodes = (RightTableCardinality - RightNdv) *
                    SizeOfDuplicateNode

KuduScanNode
1.MemoryEstimate = Columns * BytesPerColumn * MaxScannerThreads,
  Columns are scanned in query, not all the columns of the table

UnitTest
1.CardinalityTest adds test cases to test memory estimation.
  Modify existing test cases related to memory estimation

Change-Id: Ic01db168ff2c6d6de33ee553a8175599f035d7a1
Reviewed-on: http://gerrit.cloudera.org:8080/16842
Reviewed-by: Zoltan Borok-Nagy <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> Improve the accuracy of resource estimation
> -------------------------------------------
>
>                 Key: IMPALA-10377
>                 URL: https://issues.apache.org/jira/browse/IMPALA-10377
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Frontend
>    Affects Versions: Impala 3.4.0
>            Reporter: liuyao
>            Assignee: liuyao
>            Priority: Major
>              Labels: estimate, memory, statistics
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> PlanNode does not consider some factors when estimating memory, this will 
> cause a large error rate
>  
> AggregationNode
>  
> 1.The memory occupied by hash table's own data structure is not considered. 
> Hash table inserts a new value, which will add a bucket. The size of a bucket 
> is 16 bytes.
> 2.When estimating the NDV of merge aggregation, if there are multiple 
> grouping exprs, it may be divided by the number of Fragment Instances several 
> times, and it should be divided only once.
> 3.When estimating the NDV of merge aggregation, and there are multiple 
> grouping exprs, the estimated memory is much smaller than the actual use.
> 4.If there is no grouping exprs, the estimated memory is much larger than the 
> actual use.
> 5.If the NDV of grouping exprs is very small, the estimated memory is much 
> larger than the actual use.
>  
> SortNode
> 1.Estimate the memory usage of external sort. the estimated memory is much 
> smaller than the actual use.
>  
>  
> HashJoinNode
> 1.The memory occupied by hash table's own data structure is not 
> considered.Hash Table will keep duplicate data, so the size of DuplicateNode 
> should be considered.
> 2.Hash table will create multiple buckets in advance. The size of these 
> buckets should be considered.
>  
> KuduScanNode
> 1.Estimate memory by scanning all columns,the estimated memory is much larger 
> than the actual use.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to