[GitHub] [spark] ulysses-you commented on a diff in pull request #40915: [SPARK-43232][SQL] Improve ObjectHashAggregateExec performance for high cardinality

via GitHub Mon, 24 Apr 2023 05:20:26 -0700


ulysses-you commented on code in PR #40915:
URL: https://github.com/apache/spark/pull/40915#discussion_r1175202831



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/ObjectAggregationIterator.scala:
##########
@@ -252,6 +249,7 @@ class SortBasedAggregator(
       var hasNextAggBuffer: Boolean = initialAggBufferIterator.next()
       private var result: AggregationBufferEntry = _
       private var groupingKey: UnsafeRow = _
+      private var aggregateMode: Int = _

Review Comment:
   Sort based aggregate has no such code. The difference is because the object 
hash aggreagtion has two iterator. One is a input buffer which is generated 
before fallback to sort based and the other is input rows. Then after going to 
sort based, we should do update for input rows and do merge for input buffer in 
partial mode.
   
   This variable is used to avoid unnecessary grouping key comparation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] ulysses-you commented on a diff in pull request #40915: [SPARK-43232][SQL] Improve ObjectHashAggregateExec performance for high cardinality

Reply via email to