Hi all,

I am working on a distributed sorting program which runs on multiple 
computation nodes.

In this sorting program, data is represented as pandas DataFrames and key 
operations are groupby, concat, and sort_values. For shuffling data among the 
computation nodes, the DataFrames are converted to Arrow Record Batches and 
communicated via Arrow Flight.

What I’ve noticed is that much time was spent on the conversion between 
DataFrame and Record Batch.

The [zero-copy 
feature](https://arrow.apache.org/docs/python/pandas.html#memory-usage-and-zero-copy)
 unfortunately cannot be applied to my case, since the DataFrames contain 
strings as well.

I wanted to try replacing DataFrames with Record Batches, so there would be no 
need of conversion. However, there seems to be no direct way to do groupby and 
sort_values on Record Batches, according to [the 
documentation](https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatch.html)

Is there a plan to add such methods to the API of Record Batch in the future?

Kind Regards

Chengxin

Sent with [ProtonMail](https://protonmail.com) Secure Email.

Reply via email to