Not that I know of. We did do some work to make it work faster in the case of lower cardinality: https://issues.apache.org/jira/browse/SPARK-17949
On Wed, Mar 27, 2019 at 4:40 PM, Erik Erlandson < eerla...@redhat.com > wrote: > > BTW, if this is known, is there an existing JIRA I should link to? > > > On Wed, Mar 27, 2019 at 4:36 PM Erik Erlandson < eerlands@ redhat. com ( > eerla...@redhat.com ) > wrote: > > >> >> >> At a high level, some candidate strategies are: >> >> 1. "fix" the logic in ScalaUDAF (possibly in conjunction with mods to UDAF >> trait itself) so that the update method can do the right thing. >> 2. Expose TypedImperativeAggregate to users for defining their own, since >> it already does the right thing. >> >> 3. As a workaround, allow users to define their own sub-classes of >> DataType. It would essentially allow one to define the sqlType of the UDT >> to be the aggregating object itself and make ser/de a no-op. I tried >> doing this and it will compile, but spark's internals only consider a >> predefined universe of DataType classes. >> >> >> All of these options are likely to have implications for the catalyst >> systems. I'm not sure if they are minor more substantial. >> >> >> On Wed, Mar 27, 2019 at 4:20 PM Reynold Xin < rxin@ databricks. com ( >> r...@databricks.com ) > wrote: >> >> >>> Yes this is known and an issue for performance. Do you have any thoughts >>> on how to fix this? >>> >>> On Wed, Mar 27, 2019 at 4:19 PM Erik Erlandson < eerlands@ redhat. com ( >>> eerla...@redhat.com ) > wrote: >>> >>> >>>> I describe some of the details here: >>>> >>>> https:/ / issues. apache. org/ jira/ browse/ SPARK-27296 ( >>>> https://issues.apache.org/jira/browse/SPARK-27296 ) >>>> >>>> >>>> >>>> The short version of the story is that aggregating data structures (UDTs) >>>> used by UDAFs are serialized to a Row object, and de-serialized, for every >>>> row in a data frame. >>>> Cheers, >>>> Erik >>>> >>> >>> >> >> > >