[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

JoshRosen Tue, 03 Nov 2015 10:23:46 -0800

Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9383#discussion_r43785456
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TungstenAggregationIterator.scala
 ---
    @@ -473,24 +473,28 @@ class TungstenAggregationIterator(
       // Part 3: Methods and fields used by hash-based aggregation.
       
///////////////////////////////////////////////////////////////////////////
     
    +  private def createHashMap(): UnsafeFixedWidthAggregationMap = {
    +    new UnsafeFixedWidthAggregationMap(
    +      initialAggregationBuffer,
    +      
StructType.fromAttributes(allAggregateFunctions.flatMap(_.aggBufferAttributes)),
    +      StructType.fromAttributes(groupingExpressions.map(_.toAttribute)),
    +      TaskContext.get().taskMemoryManager(),
    +      1024 * 16, // initial capacity
    +      TaskContext.get().taskMemoryManager().pageSizeBytes,
    +      false // disable tracking of performance metrics
    +    )
    +  }
    +
       // This is the hash map used for hash-based aggregation. It is backed by 
an
       // UnsafeFixedWidthAggregationMap and it is used to store
       // all groups and their corresponding aggregation buffers for hash-based 
aggregation.
    -  private[this] val hashMap = new UnsafeFixedWidthAggregationMap(
    -    initialAggregationBuffer,
    -    
StructType.fromAttributes(allAggregateFunctions.flatMap(_.aggBufferAttributes)),
    -    StructType.fromAttributes(groupingExpressions.map(_.toAttribute)),
    -    TaskContext.get().taskMemoryManager(),
    -    1024 * 16, // initial capacity
    -    TaskContext.get().taskMemoryManager().pageSizeBytes,
    -    false // disable tracking of performance metrics
    -  )
    +  private[this] var hashMap = createHashMap()
     
       // The function used to read and process input rows. When processing 
input rows,
       // it first uses hash-based aggregation by putting groups and their 
buffers in
       // hashMap. If we could not allocate more memory for the map, we switch 
to
       // sort-based aggregation (by calling switchToSortBasedAggregation).
    --- End diff --
    
    Update comment to reflect the fact that we use multiple hash-maps, spilling 
after each becomes full then using sort to merge the spills?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Reply via email to