At a high level, some candidate strategies are: 1. "fix" the logic in ScalaUDAF (possibly in conjunction with mods to UDAF trait itself) so that the update method can do the right thing. 2. Expose TypedImperativeAggregate to users for defining their own, since it already does the right thing. 3. As a workaround, allow users to define their own sub-classes of DataType. It would essentially allow one to define the sqlType of the UDT to be the aggregating object itself and make ser/de a no-op. I tried doing this and it will compile, but spark's internals only consider a predefined universe of DataType classes.
All of these options are likely to have implications for the catalyst systems. I'm not sure if they are minor more substantial. On Wed, Mar 27, 2019 at 4:20 PM Reynold Xin <r...@databricks.com> wrote: > Yes this is known and an issue for performance. Do you have any thoughts > on how to fix this? > > On Wed, Mar 27, 2019 at 4:19 PM Erik Erlandson <eerla...@redhat.com> > wrote: > >> I describe some of the details here: >> https://issues.apache.org/jira/browse/SPARK-27296 >> >> The short version of the story is that aggregating data structures (UDTs) >> used by UDAFs are serialized to a Row object, and de-serialized, for every >> row in a data frame. >> Cheers, >> Erik >> >>