erikerlandson commented on a change in pull request #25024: [SPARK-27296][SQL] 
User Defined Aggregators that do not ser/de on each input row
URL: https://github.com/apache/spark/pull/25024#discussion_r329818090
 
 

 ##########
 File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/udaf.scala
 ##########
 @@ -450,3 +452,165 @@ case class ScalaUDAF(
 
   override def nodeName: String = udaf.getClass.getSimpleName
 }
+
+/**
+ * The internal wrapper used to hook a [[UserDefinedImperativeAggregator]] 
`udia` in the
+ * internal aggregation code path.
+ */
+case class ScalaUDIA[T](
+    children: Seq[Expression],
+    udia: UserDefinedImperativeAggregator[T],
+    mutableAggBufferOffset: Int = 0,
+    inputAggBufferOffset: Int = 0)
+  extends TypedImperativeAggregate[T]
+  with NonSQLExpression
+  with UserDefinedExpression
+  with ImplicitCastInputTypes
+  with Logging {
+
+  def dataType: DataType = udia.resultType
+
+  val inputTypes: Seq[DataType] = udia.inputSchema.map(_.dataType)
 
 Review comment:
   Inputs are not assumed to be nullable, however the logic around that is 
implicitly embodied inside the `update` method: the implementation of `update` 
can either check for null inputs or not.  It might be possible to define a 
nullable-update and non-nullable-update, so that checks for null input can be 
skipped if the aggregation is on some columns known to never contain null 
values.  The nullable-update could default to wrapper around the 
non-nullable-update that skips if any input is null, with the option to 
override. Encoders support this kind of dual functionality, for example.  
   
   Based on my performance studies, my intuition is that checks for null inputs 
are not likely to add significant overhead relative to all the other compute 
going on, however if the common use-case could ignore that definition maybe 
it's worth allowing the option

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to