For what it is worth, DataFusion has a DataFrame interface[1], that uses
the same underlying `LogicalPlan` structures as the SQL interface.
Unsurprisingly it is heavily inspired by pandas.

I believe that this interface seems more familiar and popular for
DataFusion users who programmatically build plans (e.g. to implement a
custom query language), even though we offer a `LogicalPlanBuilder` [2] as
well.

So I think there is value in a DataFrame API (that wraps the C++ engine,
for example). But I am not sure DataFrames are at the same level as the
"Arrow Array" interface

Andrew


[1]
https://github.com/apache/arrow-datafusion/blob/6a69f529edb3087aeba57c9f01031a98ad06dd5d/datafusion/core/src/dataframe.rs
[2]
https://github.com/apache/arrow-datafusion/blob/6a69f529edb3087aeba57c9f01031a98ad06dd5d/datafusion/core/src/logical_plan/builder.rs#L58-L95

On Thu, May 12, 2022 at 1:14 PM Wes McKinney <wesmck...@gmail.com> wrote:

> > Discussion about whether the community around Arrow would like to have
> DataFrame-like APIs for Arrow in more languages, for example C++
>
> We've discussed this a bit on the mailing list in the past, see
>
>
> https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/edit#heading=h.g70gstc7jq4h
>
> for example. It's a complicated subject because the problems that need
> solving in a "data frame library" are much more than defining an API —
> they involve establishing execution and mutation/copy-on-write
> semantics (the latter which has been a huge topic of discussion in the
> pandas community, for example). The API would be driving an internal
> data management logic engine (similar to pandas's internal logic
> engine — but hopefully we could make something without as many
> problems) which would manipulate chunks of in-memory and out-of-core
> Arrow data internally.
>
> I still would be interested in an Arrow-native "data frame library"
> similar to the SFrame library that's part of Apple's (now defunct?)
> Turi Create library [1]
>
> It's a can of worms but a problem not approached lightly (thinking of
> that "one does not simply..." meme right now) and best done in heavy
> consultation with communities that have experience supporting
> production use of data frames for data science use cases for many
> years.
>
> [1]: https://github.com/apple/turicreate
>
> On Wed, May 11, 2022 at 11:38 PM Ian Cook <i...@ursacomputing.com> wrote:
> >
> > Attendees:
> >
> > Joris Van den Bossche
> > Ian Cook
> > Nic Crane
> > Raul Cumplido
> > Ian Joiner
> > David Li
> > Rok Mihevc
> > Dragoș Moldovan-Grünfeld
> > Aldrin Montana
> > Weston Pace
> > Eduardo Ponce
> > Matthew Topol
> > Jacob Wujciak
> >
> >
> > Discussion:
> >
> > Eduardo: Draft PR with a guide showing how to create a new Arrow C++
> > compute kernel [1]
> >  - Review requested
> >
> > Weston: Proposed changes to ExecPlan in Arrow C++ compute engine [2]
> >  - Feedback requested on details described in the Jira
> >
> > Rok: Temporal rounding kernels option in Arrow C++ compute engine [3]
> >  - Feedback requested about what we should name it
> >  - Possibilities include ceil_on_boundary, ceil_is_strictly_greater,
> > strict_ceil, ceil_is_strictly_greater, is_strict_ceil, ceil_is_strict
> >  - Joris favors ceil_is_strictly_greater
> >
> > Ian C: Discussion about naming the Arrow C++ engine [4]
> >  - Comments welcome on the mailing list
> >
> > David: ADBC (Arrow Database Connectivity) proposal [5][6]
> >  - Feedback requested
> >
> > Ian C: Discussion about whether the community around Arrow would like
> > to have DataFrame-like APIs for Arrow in more languages, for example
> > C++
> >  - For C++, maybe this would look similar to xframe [7]
> >  - Probably better to approach projects like these outside of Arrow
> > and have them produce plans in Substrait format [8] which the Arrow
> > C++ engine (and other engines) could consume and execute
> >
> > Arrow 8.0.0 release
> >  - Most post-release tasks complete
> >  - Please contribute to the release blog post [9]
> >
> > Release process
> >  - Please comment on the proposed RC process change [10]
> >  - There is a discussion about changing to a bimonthly major releases
> > (instead of quarterly which is what we do now)
> >  - To make this work we could need nightly builds to be more stable;
> > Raul and Jacob are working on this
> >
> > Should we publicly share a link that Arrow developers can use to join
> > the Zuilp chat?
> >  - Zulip has instructions for how to do this  [11]
> >  - We would need a Zulip admin to change the permissions to enable
> > this (Wes, Antonie, Weston, at al are admins)
> >  - What about the ASF Slack [12] ? Should we share the details about
> that?
> >    - The Slack has a rarely used Arrow channel and a Rust Arrow
> > channel which is more popular
> >    - There were some doubts about whether committer permissions or the
> > associated apache.org email address are required to join, but in fact
> > anyone can join this Slack
> >  - Ian will follow up about this
> >
> > The Data Thread [13]
> >  - Voltron Data is hosting an Arrow-focused virtual conference on June 23
> >  - Registration and speaker applications are open
> >
> > [1] https://github.com/apache/arrow/pull/10296
> > [2] https://issues.apache.org/jira/browse/ARROW-16522
> > [3]
> https://github.com/apache/arrow/pull/12657/files#diff-6bc7ecec6a4f7bcefc2511cde3bd809340ad0d94bb8f7cc5f4994063c798f2faR124-R132
> > [4] https://lists.apache.org/thread/02sdm4jmqg2z98kr1mg2yq13q858xbx6
> > [5] https://lists.apache.org/thread/gnz1kz2rj3rb8rh8qz7l0mv8lvzq254w
> > [6]
> https://docs.google.com/document/d/1t7NrC76SyxL_OffATmjzZs2xcj1owdUsIF2WKL_Zw1U/
> > [7] https://xframe.readthedocs.io/en/latest/index.html
> > [8] https://substrait.io
> > [9] https://github.com/apache/arrow-site/pull/207
> > [10] https://lists.apache.org/thread/g6mqpyq2hc11xbgrq2pf653njzy53plt
> > [11] https://zulip.com/help/invite-new-users#create-an-invitation-link
> > [12] https://the-asf.slack.com/
> > [13] https://thedatathread.com
> >
> >
> > On Wed, May 11, 2022 at 9:23 AM Ian Cook <i...@ursacomputing.com> wrote:
> > >
> > > Hi all,
> > >
> > > Our biweekly sync call is today at 12:00 noon Eastern time.
> > >
> > > The Zoom meeting URL for this and other biweekly Arrow sync calls is:
> > > https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09
> > >
> > > Alternatively, enter this information into the Zoom website or app to
> > > join the call:
> > > Meeting ID: 876 4903 3008
> > > Passcode: 958092
> > >
> > > Thanks,
> > > Ian
>

Reply via email to