In Spark-SQL, is there support for distributed execution of native Hive UDAFs?

daniel.mescheder Thu, 23 Apr 2015 01:32:18 -0700

Hi everyone,

I was playing with the integration of Hive UDAFs in Spark-SQL and noticed that 
the terminatePartial and merge methods of custom UDAFs were not called. This 
made me curious as those two methods are the ones responsible for distributing 
the UDAF execution in Hive.
Looking at the code of HiveUdafFunction which seems to be the wrapper for all 
native Hive functions for which there exists no spark-sql specific 
implementation, I noticed that it


a) extends AggregateFunction and not PartialAggregate
b) only contains calls to iterate and evaluate, but never to merge of the 
underlying UDAFEvaluator object

My question is thus twofold: Is my observation correct, that to achieve 
distributed execution of a UDAF I have to add a custom implementation at the 
spark-sql layer (like the examples in aggregates.scala)? If that is the case, 
how difficult would it be to use the terminatePartial and merge functions 
provided by the UDAFEvaluator to make Hive UDAFs distributed by default?


Cheers,

Daniel



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/In-Spark-SQL-is-there-support-for-distributed-execution-of-native-Hive-UDAFs-tp11753.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

In Spark-SQL, is there support for distributed execution of native Hive UDAFs?

Reply via email to