Re: Support of more manipulation for Record Batch

Chengxin Ma Fri, 03 Apr 2020 06:54:16 -0700

Hi Wes,

Thank you for your answer.
The projects you mentioned look very exciting. I will keep an eye on them.


Kind Regards
Chengxin


Sent with ProtonMail Secure Email.

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Thursday, April 2, 2020 5:46 PM, Wes McKinney <wesmck...@gmail.com> wrote:

> hi Chengxin,
>
> Yes, if you look at the JIRA tracker and look for past discussions on
> the mailing list, there are plans to develop comprehensive data
> manipulation and query processing capabilities in this project for use
> in Python, R, and any other language that binds to C++, including
> C/GLib and Ruby.
>
> The way that this functionality is exposed in the pyarrow API will
> almost certainly be different than pandas, though. Rather than have
> objects with long lists of instance methods, we would opt instead for
> computational functions that "act" on the data structures, producing
> one or more data structures as output, more similar to tools like
> dplyr (an R library). Developers are welcome to create pandas-like
> convenience layers, of course, should they so choose.
>
> References:
>
> -   C++ datasets API project
>     
> https://docs.google.com/document/d/1bVhzifD38qDypnSjtf8exvpP3sSB5x_Kw9m-n66FB2c/edit?usp=sharing
>
> -   C++ query engine project
>     
> https://docs.google.com/document/d/10RoUZmiMQRi_J1FcPeVAUAMJ6d_ZuiEbaM2Y33sNPu4/edit?usp=sharing
>
> -   C++ data frame API project
>     
> https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/edit?usp=sharing
>
>     Building these things take time, especially considering the scope of
>     maintenance involved with keeping this project running. If anyone
>     reading is interested in contributing time or money to this effort I'd
>     be happy to speak with you offline about it. If you would like to
>     contribute we would be glad to have you aboard.
>
>     Thanks
>     Wes
>
>     On Thu, Apr 2, 2020 at 6:50 AM Chengxin Ma c...@protonmail.ch.invalid 
> wrote:
>
>
> > Hi all,
> > I am working on a distributed sorting program which runs on multiple 
> > computation nodes.
> > In this sorting program, data is represented as pandas DataFrames and key 
> > operations are groupby, concat, and sort_values. For shuffling data among 
> > the computation nodes, the DataFrames are converted to Arrow Record Batches 
> > and communicated via Arrow Flight.
> > What I’ve noticed is that much time was spent on the conversion between 
> > DataFrame and Record Batch.
> > The zero-copy feature unfortunately cannot be applied to my case, since the 
> > DataFrames contain strings as well.
> > I wanted to try replacing DataFrames with Record Batches, so there would be 
> > no need of conversion. However, there seems to be no direct way to do 
> > groupby and sort_values on Record Batches, according to the documentation
> > Is there a plan to add such methods to the API of Record Batch in the 
> > future?
> > Kind Regards
> > Chengxin
> > Sent with ProtonMail Secure Email.

Re: Support of more manipulation for Record Batch

Reply via email to