Hi all, I am working on a distributed sorting program which runs on multiple computation nodes.
In this sorting program, data is represented as pandas DataFrames and key operations are groupby, concat, and sort_values. For shuffling data among the computation nodes, the DataFrames are converted to Arrow Record Batches and communicated via Arrow Flight. What I’ve noticed is that much time was spent on the conversion between DataFrame and Record Batch. The [zero-copy feature](https://arrow.apache.org/docs/python/pandas.html#memory-usage-and-zero-copy) unfortunately cannot be applied to my case, since the DataFrames contain strings as well. I wanted to try replacing DataFrames with Record Batches, so there would be no need of conversion. However, there seems to be no direct way to do groupby and sort_values on Record Batches, according to [the documentation](https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatch.html) Is there a plan to add such methods to the API of Record Batch in the future? Kind Regards Chengxin Sent with [ProtonMail](https://protonmail.com) Secure Email.