[
https://issues.apache.org/jira/browse/IMPALA-10377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17300682#comment-17300682
]
ASF subversion and git services commented on IMPALA-10377:
----------------------------------------------------------
Commit 1a01bfe831b548204cde8087def51a2f27b40cb4 in impala's branch
refs/heads/master from liuyao
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=1a01bfe ]
IMPALA-10377: Improve the accuracy of resource estimation
PlanNode does not consider some factors when estimating memory,
this will cause a large error rate
AggregationNode
1.MemoryEstimate = Ndv * (AvgRowSize + SizeOfBucket)
2.When estimating the Ndv of merge aggregation, Ndv should be
divided only once.
3.If there is no grouping exprs, MemoryEstimate =
MIN_PLAIN_AGG_MEM
SortNode
1.MemoryEstimate = Cardinality * AvgRowSize. Memory used when
there is enough memory
HashJoinNode
1.MemoryEstimate= DataRows + Buckets + DuplicateNodes,
DataRows = RightTableCardinality * AvgRowSize,
Buckets= roundUpToPowerOf2(RightTableCardinality) *
SizeOfBucket,
DuplicateNodes = (RightTableCardinality - RightNdv) *
SizeOfDuplicateNode
KuduScanNode
1.MemoryEstimate = Columns * BytesPerColumn * MaxScannerThreads,
Columns are scanned in query, not all the columns of the table
UnitTest
1.CardinalityTest adds test cases to test memory estimation.
Modify existing test cases related to memory estimation
Change-Id: Ic01db168ff2c6d6de33ee553a8175599f035d7a1
Reviewed-on: http://gerrit.cloudera.org:8080/16842
Reviewed-by: Zoltan Borok-Nagy <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>
> Improve the accuracy of resource estimation
> -------------------------------------------
>
> Key: IMPALA-10377
> URL: https://issues.apache.org/jira/browse/IMPALA-10377
> Project: IMPALA
> Issue Type: Improvement
> Components: Frontend
> Affects Versions: Impala 3.4.0
> Reporter: liuyao
> Assignee: liuyao
> Priority: Major
> Labels: estimate, memory, statistics
> Original Estimate: 120h
> Remaining Estimate: 120h
>
> PlanNode does not consider some factors when estimating memory, this will
> cause a large error rate
>
> AggregationNode
>
> 1.The memory occupied by hash table's own data structure is not considered.
> Hash table inserts a new value, which will add a bucket. The size of a bucket
> is 16 bytes.
> 2.When estimating the NDV of merge aggregation, if there are multiple
> grouping exprs, it may be divided by the number of Fragment Instances several
> times, and it should be divided only once.
> 3.When estimating the NDV of merge aggregation, and there are multiple
> grouping exprs, the estimated memory is much smaller than the actual use.
> 4.If there is no grouping exprs, the estimated memory is much larger than the
> actual use.
> 5.If the NDV of grouping exprs is very small, the estimated memory is much
> larger than the actual use.
>
> SortNode
> 1.Estimate the memory usage of external sort. the estimated memory is much
> smaller than the actual use.
>
>
> HashJoinNode
> 1.The memory occupied by hash table's own data structure is not
> considered.Hash Table will keep duplicate data, so the size of DuplicateNode
> should be considered.
> 2.Hash table will create multiple buckets in advance. The size of these
> buckets should be considered.
>
> KuduScanNode
> 1.Estimate memory by scanning all columns,the estimated memory is much larger
> than the actual use.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]