RE: Arrow Datasets Functionality for Python

Matthew Turner Mon, 17 Feb 2020 19:07:14 -0800

Hi Francois,

Thanks for the response - the explanation definitely helped and I will review 
the provided documents.


Hi Wes,

I am interested in helping but I have two constraints:

        - With my current schedule I wont have free time for another 2-3 months
        - My skillset is more on the end user / business side.  My main job is 
on a trading desk and I am driving our efforts to build out more analytic 
capabilities for the desk (leveraging heavily on parquet/pyarrow/pandas).  To 
the extent you think I could still add value I'm happy to discuss further.

Either way, thanks all for the work and I look forward to all the developments 
this year.

Best,

Matthew M. Turner
Email: [email protected]
Phone: (908)-868-2786

-----Original Message-----
From: Wes McKinney <[email protected]> 
Sent: Monday, February 10, 2020 10:33 AM
To: dev <[email protected]>
Subject: Re: Arrow Datasets Functionality for Python

I will add that I'm interested in being involved with developing high level 
Python interfaces to all of this functionality (e.g. using Ibis [1]). It would 
be worth prototyping at least a datasets interface layer for efficient data 
selection (predicate pushdown + filtering) and then expanding to support more 
analytic operations as they are implemented and available in pyarrow. There's 
just a lot of work to do and at the moment not a lot of people to do it. 
Hopefully more organizations will sponsor part- or full-time developers to get 
involved in Apache Arrow development and help with maintenance and feature 
development -- this is a challenging project to contribute to on 
nights/weekends.

[1]: https://github.com/ibis-project/ibis

On Mon, Feb 10, 2020 at 8:34 AM Francois Saint-Jacques 
<[email protected]> wrote:
>
> Hello Matthew,
>
> The dplyr binding is just syntactic sugar on top of the dataset API.
> There's no analytics capabilities yet [1], other than the select and 
> the limited projection supported by the dataset API. It looks like it 
> is doing analytics due to properly placed `collect()` calls, which 
> converts from Arrow's stream of RecordBatch to R internal data frames.
> The analytic work is done by R. The same functionality exists under 
> python, you invoke the dataset scan and then pass the result to 
> pandas.
>
> In 2020 [2], we are actively working toward an analytic engine, with 
> bindings for R *and* Python. Within this engine, we have physical 
> operators, or compute kernels, that can be seen as functions that 
> takes a stream of RecordBatch and yields a new stream of RecordBatch.
> The dataset API is the Scan physical operators, i.e. it materialize a 
> stream of RecordBatch from files or other sources. Gandiva is a 
> compiler that generates the Filter and Project physical operators.
> Think of gandiva as a physical operator factory, you give it a 
> predicate (or multiple expression in the case of projection) and it 
> gives you back a function pointer that knows how to evaluate this 
> predicate (expressions) on a RecordBatch and yields a RecordBatch.
> There still needs to be a coordinator on top of both that "plugs"
> them, i.e. the execution engine.
>
> Hope this helps,
> François
>
> [1] 
> https://github.com/apache/arrow/blob/6600a39ffe149971afd5ad3c78c2b538c
> dc03cfd/r/R/dplyr.R#L255-L322 [2] 
> https://ursalabs.org/blog/2020-outlook/
>
>
>
> On Sun, Feb 9, 2020 at 11:24 PM Matthew Turner 
> <[email protected]> wrote:
> >
> > Hi Wes / Arrow Dev Team,
> >
> > Following up on our brief twitter 
> > convo<https://twitter.com/wesmckinn/status/1222647039252525057> on the 
> > Datasets functionality in R / Python.
> >
> > To provide context to others, you had mentioned that the API in python / 
> > pyarrow was more developer centric and intended for users to consume it 
> > through higher level interfaces(i.e. IBIS).  This was in comparison to 
> > dplyr which from your demo had some nice analytic capabilities on top of 
> > Arrow Datasets.
> >
> > Seeing that demonstration made me interested to see similar Arrow Datasets 
> > functionality within Python.  But it doesn't seem that is an intended 
> > capability for pyarrow which I do generally understand.  However, I was 
> > trying to understand how Gandiva ties into the Arrow project as I 
> > understand that to be an analytic engine of sorts (maybe im 
> > misunderstanding).  I saw this<http://blog.christianperone.com/tag/python/> 
> > implementation of Gandiva with pandas which was quite interesting and was 
> > wondering if this is the strategic goal - to have Gandiva be a component of 
> > third party tools who use arrow or if Gandiva would eventually be more of a 
> > core analytic component of Arrow.
> >
> > Extending on this I hoping to get the teams view on what they see as the 
> > likely home of dplyr datasets type functionality within the python 
> > ecosystem (i.e. IBIS or something else).
> >
> > Thanks to all for your work on the project!
> >
> > Best,
> >
> > Matthew M. Turner
> > Email: 
> > [email protected]<mailto:[email protected]>
> > Phone: (908)-868-2786
> >

RE: Arrow Datasets Functionality for Python

Reply via email to