Hi Wes, Thank you for your answer. The projects you mentioned look very exciting. I will keep an eye on them.
Kind Regards Chengxin Sent with ProtonMail Secure Email. ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ On Thursday, April 2, 2020 5:46 PM, Wes McKinney <wesmck...@gmail.com> wrote: > hi Chengxin, > > Yes, if you look at the JIRA tracker and look for past discussions on > the mailing list, there are plans to develop comprehensive data > manipulation and query processing capabilities in this project for use > in Python, R, and any other language that binds to C++, including > C/GLib and Ruby. > > The way that this functionality is exposed in the pyarrow API will > almost certainly be different than pandas, though. Rather than have > objects with long lists of instance methods, we would opt instead for > computational functions that "act" on the data structures, producing > one or more data structures as output, more similar to tools like > dplyr (an R library). Developers are welcome to create pandas-like > convenience layers, of course, should they so choose. > > References: > > - C++ datasets API project > > https://docs.google.com/document/d/1bVhzifD38qDypnSjtf8exvpP3sSB5x_Kw9m-n66FB2c/edit?usp=sharing > > - C++ query engine project > > https://docs.google.com/document/d/10RoUZmiMQRi_J1FcPeVAUAMJ6d_ZuiEbaM2Y33sNPu4/edit?usp=sharing > > - C++ data frame API project > > https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/edit?usp=sharing > > Building these things take time, especially considering the scope of > maintenance involved with keeping this project running. If anyone > reading is interested in contributing time or money to this effort I'd > be happy to speak with you offline about it. If you would like to > contribute we would be glad to have you aboard. > > Thanks > Wes > > On Thu, Apr 2, 2020 at 6:50 AM Chengxin Ma c...@protonmail.ch.invalid > wrote: > > > > Hi all, > > I am working on a distributed sorting program which runs on multiple > > computation nodes. > > In this sorting program, data is represented as pandas DataFrames and key > > operations are groupby, concat, and sort_values. For shuffling data among > > the computation nodes, the DataFrames are converted to Arrow Record Batches > > and communicated via Arrow Flight. > > What I’ve noticed is that much time was spent on the conversion between > > DataFrame and Record Batch. > > The zero-copy feature unfortunately cannot be applied to my case, since the > > DataFrames contain strings as well. > > I wanted to try replacing DataFrames with Record Batches, so there would be > > no need of conversion. However, there seems to be no direct way to do > > groupby and sort_values on Record Batches, according to the documentation > > Is there a plan to add such methods to the API of Record Batch in the > > future? > > Kind Regards > > Chengxin > > Sent with ProtonMail Secure Email.