Github user icexelloss commented on the issue:
https://github.com/apache/spark/pull/20295
Hi all,
I did some digging and I think adding a serialization form that serialize a
key object along with a Arrow record batch is quite complicated because we are
using ArrowStreamReader/Writer for sending batches and send extra key data
would have to use a lower level Arrow API for sending/receiving batches.
I did two things to convince myself the current approach is fine:
* I add logic to de duplicate grouping key they are already in data
columns. i.e., if a user calls
```
df.groupby('id').apply(foo_udf)
```
We will not send extra grouping columns because those are already part of
data columns. Instead, we will just use the corresponding data column to get
grouping key to pass to user function. However, if user calls:
```
df.groupby(df.id % 2).apply(foo_udf)
```
then an extra column `df.id % 2` will be created and sent to python worker.
But I think this is an uncommon case.
* I did some benchmark to see the impact of sending extra grouping column.
I used a Spark DataFrame of a single column to maximize the effect of the extra
grouping column (basically sending extra grouping column will double the data
to be sent to python in the benchmark, however in real use cases the effect of
sending extra grouping columns should be far less).
Even with the setting of the benchmark, I have not observed performance
diffs when sending extra grouping columns, I think this is because the total
time is dominated by other parts of the computation. [micro
benchmark](https://gist.github.com/icexelloss/88f6c6fdaf04aac39d68d74cd0942c07)
I'd like to leave the work for more flexible arrow serialization as future
work because it doesn't seems to affect performance of this patch and proceed
with the current patch based on the two points above. What do you guys think?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]