They are unfortunately all pretty substantial (which is why this problem exists) ...
On Wed, Mar 27, 2019 at 4:36 PM, Erik Erlandson < eerla...@redhat.com > wrote: > > At a high level, some candidate strategies are: > > 1. "fix" the logic in ScalaUDAF (possibly in conjunction with mods to UDAF > trait itself) so that the update method can do the right thing. > 2. Expose TypedImperativeAggregate to users for defining their own, since > it already does the right thing. > > 3. As a workaround, allow users to define their own sub-classes of > DataType. It would essentially allow one to define the sqlType of the UDT > to be the aggregating object itself and make ser/de a no-op. I tried > doing this and it will compile, but spark's internals only consider a > predefined universe of DataType classes. > > > All of these options are likely to have implications for the catalyst > systems. I'm not sure if they are minor more substantial. > > > On Wed, Mar 27, 2019 at 4:20 PM Reynold Xin < rxin@ databricks. com ( > r...@databricks.com ) > wrote: > > >> Yes this is known and an issue for performance. Do you have any thoughts >> on how to fix this? >> >> On Wed, Mar 27, 2019 at 4:19 PM Erik Erlandson < eerlands@ redhat. com ( >> eerla...@redhat.com ) > wrote: >> >> >>> I describe some of the details here: >>> >>> https:/ / issues. apache. org/ jira/ browse/ SPARK-27296 ( >>> https://issues.apache.org/jira/browse/SPARK-27296 ) >>> >>> >>> >>> The short version of the story is that aggregating data structures (UDTs) >>> used by UDAFs are serialized to a Row object, and de-serialized, for every >>> row in a data frame. >>> Cheers, >>> Erik >>> >> >> > >