Re: UDAFs have an inefficiency problem

Reynold Xin Wed, 27 Mar 2019 16:40:26 -0700

They are unfortunately all pretty substantial (which is why this problem 
exists) ...


On Wed, Mar 27, 2019 at 4:36 PM, Erik Erlandson < eerla...@redhat.com > wrote:

> 
> At a high level, some candidate strategies are:
> 
> 1. "fix" the logic in ScalaUDAF (possibly in conjunction with mods to UDAF
> trait itself) so that the update method can do the right thing.
> 2. Expose TypedImperativeAggregate to users for defining their own, since
> it already does the right thing.
> 
> 3. As a workaround, allow users to define their own sub-classes of
> DataType.  It would essentially allow one to define the sqlType of the UDT
> to be the aggregating object itself and make ser/de a no-op.  I tried
> doing this and it will compile, but spark's internals only consider a
> predefined universe of DataType classes.
> 
> 
> All of these options are likely to have implications for the catalyst
> systems. I'm not sure if they are minor more substantial.
> 
> 
> On Wed, Mar 27, 2019 at 4:20 PM Reynold Xin < rxin@ databricks. com (
> r...@databricks.com ) > wrote:
> 
> 
>> Yes this is known and an issue for performance. Do you have any thoughts
>> on how to fix this?
>> 
>> On Wed, Mar 27, 2019 at 4:19 PM Erik Erlandson < eerlands@ redhat. com (
>> eerla...@redhat.com ) > wrote:
>> 
>> 
>>> I describe some of the details here:
>>> 
>>> https:/ / issues. apache. org/ jira/ browse/ SPARK-27296 (
>>> https://issues.apache.org/jira/browse/SPARK-27296 )
>>> 
>>> 
>>> 
>>> The short version of the story is that aggregating data structures (UDTs)
>>> used by UDAFs are serialized to a Row object, and de-serialized, for every
>>> row in a data frame.
>>> Cheers,
>>> Erik
>>> 
>> 
>> 
> 
>

Re: UDAFs have an inefficiency problem

Reply via email to