[GitHub] spark issue #20295: [WIP][SPARK-23011] Support alternative function form wit...

icexelloss Wed, 24 Jan 2018 14:35:47 -0800

Github user icexelloss commented on the issue:

    https://github.com/apache/spark/pull/20295
  
    Hi all,
    
    I did some digging and I think adding a serialization form that serialize a 
key object along with a Arrow record batch is quite complicated because we are 
using ArrowStreamReader/Writer for sending batches and send extra key data 
would have to use a lower level Arrow API for sending/receiving batches.
    
    I did two things to convince myself the current approach is fine:
    * I add logic to de duplicate grouping key they are already in data 
columns. i.e., if a user calls
    ```
    df.groupby('id').apply(foo_udf)
    ```
    We will not send extra grouping columns because those are already part of 
data columns. Instead, we will just use the corresponding data column to get 
grouping key to pass to user function. However, if user calls:
    ```
    df.groupby(df.id % 2).apply(foo_udf)
    ```
    then an extra column `df.id % 2` will be created and sent to python worker. 
But I think this is an uncommon case.
    
    * I did some benchmark to see the impact of sending extra grouping column. 
I used a Spark DataFrame of a single column to maximize the effect of the extra 
grouping column (basically sending extra grouping column will double the data 
to be sent to python in the benchmark, however in real use cases the effect of 
sending extra grouping columns should be far less).
    Even with the setting of the benchmark, I have not observed performance 
diffs when sending extra grouping columns, I think this is because the total 
time is dominated by other parts of the computation. [micro 
benchmark](https://gist.github.com/icexelloss/88f6c6fdaf04aac39d68d74cd0942c07)
    
    I'd like to leave the work for more flexible arrow serialization as future 
work because it doesn't seems to affect performance of this patch and proceed 
with the current patch based on the two points above. What do you guys think?



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #20295: [WIP][SPARK-23011] Support alternative function form wit...

Reply via email to