Github user JoshRosen commented on a diff in the pull request:
https://github.com/apache/spark/pull/9383#discussion_r43785456
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TungstenAggregationIterator.scala
---
@@ -473,24 +473,28 @@ class TungstenAggregationIterator(
// Part 3: Methods and fields used by hash-based aggregation.
///////////////////////////////////////////////////////////////////////////
+ private def createHashMap(): UnsafeFixedWidthAggregationMap = {
+ new UnsafeFixedWidthAggregationMap(
+ initialAggregationBuffer,
+
StructType.fromAttributes(allAggregateFunctions.flatMap(_.aggBufferAttributes)),
+ StructType.fromAttributes(groupingExpressions.map(_.toAttribute)),
+ TaskContext.get().taskMemoryManager(),
+ 1024 * 16, // initial capacity
+ TaskContext.get().taskMemoryManager().pageSizeBytes,
+ false // disable tracking of performance metrics
+ )
+ }
+
// This is the hash map used for hash-based aggregation. It is backed by
an
// UnsafeFixedWidthAggregationMap and it is used to store
// all groups and their corresponding aggregation buffers for hash-based
aggregation.
- private[this] val hashMap = new UnsafeFixedWidthAggregationMap(
- initialAggregationBuffer,
-
StructType.fromAttributes(allAggregateFunctions.flatMap(_.aggBufferAttributes)),
- StructType.fromAttributes(groupingExpressions.map(_.toAttribute)),
- TaskContext.get().taskMemoryManager(),
- 1024 * 16, // initial capacity
- TaskContext.get().taskMemoryManager().pageSizeBytes,
- false // disable tracking of performance metrics
- )
+ private[this] var hashMap = createHashMap()
// The function used to read and process input rows. When processing
input rows,
// it first uses hash-based aggregation by putting groups and their
buffers in
// hashMap. If we could not allocate more memory for the map, we switch
to
// sort-based aggregation (by calling switchToSortBasedAggregation).
--- End diff --
Update comment to reflect the fact that we use multiple hash-maps, spilling
after each becomes full then using sort to merge the spills?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]