Re: UDAFs have an inefficiency problem

Erik Erlandson Wed, 27 Mar 2019 16:37:48 -0700

At a high level, some candidate strategies are:
1. "fix" the logic in ScalaUDAF (possibly in conjunction with mods to UDAF
trait itself) so that the update method can do the right thing.
2. Expose TypedImperativeAggregate to users for defining their own, since
it already does the right thing.
3. As a workaround, allow users to define their own sub-classes of
DataType.  It would essentially allow one to define the sqlType of the UDT
to be the aggregating object itself and make ser/de a no-op.  I tried doing
this and it will compile, but spark's internals only consider a predefined
universe of DataType classes.


All of these options are likely to have implications for the catalyst
systems. I'm not sure if they are minor more substantial.

On Wed, Mar 27, 2019 at 4:20 PM Reynold Xin <r...@databricks.com> wrote:

> Yes this is known and an issue for performance. Do you have any thoughts
> on how to fix this?
>
> On Wed, Mar 27, 2019 at 4:19 PM Erik Erlandson <eerla...@redhat.com>
> wrote:
>
>> I describe some of the details here:
>> https://issues.apache.org/jira/browse/SPARK-27296
>>
>> The short version of the story is that aggregating data structures (UDTs)
>> used by UDAFs are serialized to a Row object, and de-serialized, for every
>> row in a data frame.
>> Cheers,
>> Erik
>>
>>

Re: UDAFs have an inefficiency problem

Reply via email to