Also, it seems as if duckdb[1] is heading in the same direction of adding a
dataframe API to their database engine

[1] https://github.com/duckdb/duckdb/issues/2000

On Thu, May 12, 2022 at 3:36 PM Andrew Lamb <al...@influxdata.com> wrote:

> For what it is worth, DataFusion has a DataFrame interface[1], that uses
> the same underlying `LogicalPlan` structures as the SQL interface.
> Unsurprisingly it is heavily inspired by pandas.
>
> I believe that this interface seems more familiar and popular for
> DataFusion users who programmatically build plans (e.g. to implement a
> custom query language), even though we offer a `LogicalPlanBuilder` [2] as
> well.
>
> So I think there is value in a DataFrame API (that wraps the C++ engine,
> for example). But I am not sure DataFrames are at the same level as the
> "Arrow Array" interface
>
> Andrew
>
>
> [1]
> https://github.com/apache/arrow-datafusion/blob/6a69f529edb3087aeba57c9f01031a98ad06dd5d/datafusion/core/src/dataframe.rs
> [2]
> https://github.com/apache/arrow-datafusion/blob/6a69f529edb3087aeba57c9f01031a98ad06dd5d/datafusion/core/src/logical_plan/builder.rs#L58-L95
>
> On Thu, May 12, 2022 at 1:14 PM Wes McKinney <wesmck...@gmail.com> wrote:
>
>> > Discussion about whether the community around Arrow would like to have
>> DataFrame-like APIs for Arrow in more languages, for example C++
>>
>> We've discussed this a bit on the mailing list in the past, see
>>
>>
>> https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/edit#heading=h.g70gstc7jq4h
>>
>> for example. It's a complicated subject because the problems that need
>> solving in a "data frame library" are much more than defining an API —
>> they involve establishing execution and mutation/copy-on-write
>> semantics (the latter which has been a huge topic of discussion in the
>> pandas community, for example). The API would be driving an internal
>> data management logic engine (similar to pandas's internal logic
>> engine — but hopefully we could make something without as many
>> problems) which would manipulate chunks of in-memory and out-of-core
>> Arrow data internally.
>>
>> I still would be interested in an Arrow-native "data frame library"
>> similar to the SFrame library that's part of Apple's (now defunct?)
>> Turi Create library [1]
>>
>> It's a can of worms but a problem not approached lightly (thinking of
>> that "one does not simply..." meme right now) and best done in heavy
>> consultation with communities that have experience supporting
>> production use of data frames for data science use cases for many
>> years.
>>
>> [1]: https://github.com/apple/turicreate
>>
>> On Wed, May 11, 2022 at 11:38 PM Ian Cook <i...@ursacomputing.com> wrote:
>> >
>> > Attendees:
>> >
>> > Joris Van den Bossche
>> > Ian Cook
>> > Nic Crane
>> > Raul Cumplido
>> > Ian Joiner
>> > David Li
>> > Rok Mihevc
>> > Dragoș Moldovan-Grünfeld
>> > Aldrin Montana
>> > Weston Pace
>> > Eduardo Ponce
>> > Matthew Topol
>> > Jacob Wujciak
>> >
>> >
>> > Discussion:
>> >
>> > Eduardo: Draft PR with a guide showing how to create a new Arrow C++
>> > compute kernel [1]
>> >  - Review requested
>> >
>> > Weston: Proposed changes to ExecPlan in Arrow C++ compute engine [2]
>> >  - Feedback requested on details described in the Jira
>> >
>> > Rok: Temporal rounding kernels option in Arrow C++ compute engine [3]
>> >  - Feedback requested about what we should name it
>> >  - Possibilities include ceil_on_boundary, ceil_is_strictly_greater,
>> > strict_ceil, ceil_is_strictly_greater, is_strict_ceil, ceil_is_strict
>> >  - Joris favors ceil_is_strictly_greater
>> >
>> > Ian C: Discussion about naming the Arrow C++ engine [4]
>> >  - Comments welcome on the mailing list
>> >
>> > David: ADBC (Arrow Database Connectivity) proposal [5][6]
>> >  - Feedback requested
>> >
>> > Ian C: Discussion about whether the community around Arrow would like
>> > to have DataFrame-like APIs for Arrow in more languages, for example
>> > C++
>> >  - For C++, maybe this would look similar to xframe [7]
>> >  - Probably better to approach projects like these outside of Arrow
>> > and have them produce plans in Substrait format [8] which the Arrow
>> > C++ engine (and other engines) could consume and execute
>> >
>> > Arrow 8.0.0 release
>> >  - Most post-release tasks complete
>> >  - Please contribute to the release blog post [9]
>> >
>> > Release process
>> >  - Please comment on the proposed RC process change [10]
>> >  - There is a discussion about changing to a bimonthly major releases
>> > (instead of quarterly which is what we do now)
>> >  - To make this work we could need nightly builds to be more stable;
>> > Raul and Jacob are working on this
>> >
>> > Should we publicly share a link that Arrow developers can use to join
>> > the Zuilp chat?
>> >  - Zulip has instructions for how to do this  [11]
>> >  - We would need a Zulip admin to change the permissions to enable
>> > this (Wes, Antonie, Weston, at al are admins)
>> >  - What about the ASF Slack [12] ? Should we share the details about
>> that?
>> >    - The Slack has a rarely used Arrow channel and a Rust Arrow
>> > channel which is more popular
>> >    - There were some doubts about whether committer permissions or the
>> > associated apache.org email address are required to join, but in fact
>> > anyone can join this Slack
>> >  - Ian will follow up about this
>> >
>> > The Data Thread [13]
>> >  - Voltron Data is hosting an Arrow-focused virtual conference on June
>> 23
>> >  - Registration and speaker applications are open
>> >
>> > [1] https://github.com/apache/arrow/pull/10296
>> > [2] https://issues.apache.org/jira/browse/ARROW-16522
>> > [3]
>> https://github.com/apache/arrow/pull/12657/files#diff-6bc7ecec6a4f7bcefc2511cde3bd809340ad0d94bb8f7cc5f4994063c798f2faR124-R132
>> > [4] https://lists.apache.org/thread/02sdm4jmqg2z98kr1mg2yq13q858xbx6
>> > [5] https://lists.apache.org/thread/gnz1kz2rj3rb8rh8qz7l0mv8lvzq254w
>> > [6]
>> https://docs.google.com/document/d/1t7NrC76SyxL_OffATmjzZs2xcj1owdUsIF2WKL_Zw1U/
>> > [7] https://xframe.readthedocs.io/en/latest/index.html
>> > [8] https://substrait.io
>> > [9] https://github.com/apache/arrow-site/pull/207
>> > [10] https://lists.apache.org/thread/g6mqpyq2hc11xbgrq2pf653njzy53plt
>> > [11] https://zulip.com/help/invite-new-users#create-an-invitation-link
>> > [12] https://the-asf.slack.com/
>> > [13] https://thedatathread.com
>> >
>> >
>> > On Wed, May 11, 2022 at 9:23 AM Ian Cook <i...@ursacomputing.com> wrote:
>> > >
>> > > Hi all,
>> > >
>> > > Our biweekly sync call is today at 12:00 noon Eastern time.
>> > >
>> > > The Zoom meeting URL for this and other biweekly Arrow sync calls is:
>> > > https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09
>> > >
>> > > Alternatively, enter this information into the Zoom website or app to
>> > > join the call:
>> > > Meeting ID: 876 4903 3008
>> > > Passcode: 958092
>> > >
>> > > Thanks,
>> > > Ian
>>
>

Reply via email to