Not that I know of. We did do some work to make it work faster in the case of 
lower cardinality: https://issues.apache.org/jira/browse/SPARK-17949

On Wed, Mar 27, 2019 at 4:40 PM, Erik Erlandson < eerla...@redhat.com > wrote:

> 
> BTW, if this is known, is there an existing JIRA I should link to?
> 
> 
> On Wed, Mar 27, 2019 at 4:36 PM Erik Erlandson < eerlands@ redhat. com (
> eerla...@redhat.com ) > wrote:
> 
> 
>> 
>> 
>> At a high level, some candidate strategies are:
>> 
>> 1. "fix" the logic in ScalaUDAF (possibly in conjunction with mods to UDAF
>> trait itself) so that the update method can do the right thing.
>> 2. Expose TypedImperativeAggregate to users for defining their own, since
>> it already does the right thing.
>> 
>> 3. As a workaround, allow users to define their own sub-classes of
>> DataType.  It would essentially allow one to define the sqlType of the UDT
>> to be the aggregating object itself and make ser/de a no-op.  I tried
>> doing this and it will compile, but spark's internals only consider a
>> predefined universe of DataType classes.
>> 
>> 
>> All of these options are likely to have implications for the catalyst
>> systems. I'm not sure if they are minor more substantial.
>> 
>> 
>> On Wed, Mar 27, 2019 at 4:20 PM Reynold Xin < rxin@ databricks. com (
>> r...@databricks.com ) > wrote:
>> 
>> 
>>> Yes this is known and an issue for performance. Do you have any thoughts
>>> on how to fix this?
>>> 
>>> On Wed, Mar 27, 2019 at 4:19 PM Erik Erlandson < eerlands@ redhat. com (
>>> eerla...@redhat.com ) > wrote:
>>> 
>>> 
>>>> I describe some of the details here:
>>>> 
>>>> https:/ / issues. apache. org/ jira/ browse/ SPARK-27296 (
>>>> https://issues.apache.org/jira/browse/SPARK-27296 )
>>>> 
>>>> 
>>>> 
>>>> The short version of the story is that aggregating data structures (UDTs)
>>>> used by UDAFs are serialized to a Row object, and de-serialized, for every
>>>> row in a data frame.
>>>> Cheers,
>>>> Erik
>>>> 
>>> 
>>> 
>> 
>> 
> 
>

Reply via email to