[GitHub] [spark] erikerlandson commented on issue #25024: [SPARK-27296][SQL] User Defined Aggregators that do not ser/de on each input row

GitBox Mon, 08 Jul 2019 11:29:00 -0700

erikerlandson commented on issue #25024: [SPARK-27296][SQL] User Defined 
Aggregators that do not ser/de on each input row
URL: https://github.com/apache/spark/pull/25024#issuecomment-509339589
 
 
   @rxin the key difference is in the `update` methods. The standard UDAF 
requires that the aggregator be stored in a MutableAggregationBuffer and so a 
UDAF update method always has this basic form:
   ```scala
     def update(buf: MutableAggregationBuffer, input: Row): Unit = {
       val agg = buf.getAs[AggregatorType](0)  // UDT deserializes the 
aggregator from 'buf'
       agg.update(input)    // update the state of your aggregation
       buf(0) = agg    // UDT re-serializes the aggregator back into buf
   }
   ```
   The consequence of this is that it is calling deserialize and (re)serialize 
for the actual aggregating structure for every single input row. If your 
dataframe has a million rows, it's doing ser/de on your aggregator a million 
times, not just at the end of each data partition.
   
   Compare that with the UDAI (which is driven by TypedImperativeAggregate)
   ```scala
   def update(agg: AggregatorType, input: Row): AggregatorType = {
     agg.update(input) // update the state of your aggregator from the input
     agg return the aggregator
   }
   ```
   You can see that here, there is no ser/de of the aggregator at all, when 
processing input rows (which is as it should be).  The TypedImperativeAggregate 
only invokes ser/de on the aggregator when it is collecting partial results 
across partitions (and at the end when it is presenting final results into the 
output data frame).
   
   So, imagine a data-frame with 10 partitions and 1 million rows.  The UDAF 
does ser/de on the aggregator a million (plus 10) times, while the UDIA does 
ser/de only 10 times.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] erikerlandson commented on issue #25024: [SPARK-27296][SQL] User Defined Aggregators that do not ser/de on each input row

Reply via email to