Re: UDAFs have an inefficiency problem

2019-07-05 Thread Erik Erlandson
I submitted a PR for this:
https://github.com/apache/spark/pull/25024

On Wed, Mar 27, 2019 at 4:19 PM Erik Erlandson  wrote:

> I describe some of the details here:
> https://issues.apache.org/jira/browse/SPARK-27296
>
> The short version of the story is that aggregating data structures (UDTs)
> used by UDAFs are serialized to a Row object, and de-serialized, for every
> row in a data frame.
> Cheers,
> Erik
>
>


Revisiting Python / pandas UDF

2019-07-05 Thread Reynold Xin
Hi all,

In the past two years, the pandas UDFs are perhaps the most important changes 
to Spark for Python data science. However, these functionalities have evolved 
organically, leading to some inconsistencies and confusions among users. I 
created a ticket and a document summarizing the issues, and a concrete proposal 
to fix them (the changes are pretty small). Thanks Xiangrui for initially 
bringing this to my attention, and Li Jin, Hyukjin, for offline discussions.

Please take a look: 

https://issues.apache.org/jira/browse/SPARK-28264

https://docs.google.com/document/u/1/d/10Pkl-rqygGao2xQf6sddt0b-4FYK4g8qr_bXLKTL65A/edit