I agree with this as well, and I it's also along the lines of what I was trying to propose here:
"[RFC] [Java] Higher-level "DataFrame"-like API. Lower barrier to entry, increase adoption/audience and productivity." https://github.com/apache/arrow/issues/12618 It would be really nice if there was a canonical, language-independent specification (or something close to it) for what a DataFrame-like API on top of Arrow should look like. Then you get continuity between languages and (in theory) it should be easier to make contributions since they wouldn't be locked to a particular language implementation. On Fri, May 13, 2022 at 10:30 AM Alessandro Molina < alessan...@ursacomputing.com> wrote: > I think Arrow should definitely consider adding a DataFrame-like API. > > There are multiple reasons why exposing Arrow to end users instead of > restricting it to developers of framework would be beneficial for the Arrow > project itself. > > A rough approximation of DataFrame like API has been growing during the > years anyway in many bindings and it's probably better to consolidate that > effort in a structured process. > The main thing I'm concerned about is adding one more interface for users. > If we want to grow DataFrame like APIs we should grow them on top of > Dataset (Table probably wouldn't give us enough memory management > flexibility) as for most users it's already confusing enough to understand > why they should use Table or Dataset. Figure if we add one more tabular > data structure. > > On Thu, May 12, 2022 at 7:14 PM Wes McKinney <wesmck...@gmail.com> wrote: > > > > Discussion about whether the community around Arrow would like to have > > DataFrame-like APIs for Arrow in more languages, for example C++ > > > > We've discussed this a bit on the mailing list in the past, see > > > > > > > https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/edit#heading=h.g70gstc7jq4h > > > > for example. It's a complicated subject because the problems that need > > solving in a "data frame library" are much more than defining an API — > > they involve establishing execution and mutation/copy-on-write > > semantics (the latter which has been a huge topic of discussion in the > > pandas community, for example). The API would be driving an internal > > data management logic engine (similar to pandas's internal logic > > engine — but hopefully we could make something without as many > > problems) which would manipulate chunks of in-memory and out-of-core > > Arrow data internally. > > > > I still would be interested in an Arrow-native "data frame library" > > similar to the SFrame library that's part of Apple's (now defunct?) > > Turi Create library [1] > > > > It's a can of worms but a problem not approached lightly (thinking of > > that "one does not simply..." meme right now) and best done in heavy > > consultation with communities that have experience supporting > > production use of data frames for data science use cases for many > > years. > > > > [1]: https://github.com/apple/turicreate > > > > On Wed, May 11, 2022 at 11:38 PM Ian Cook <i...@ursacomputing.com> wrote: > > > > > > Attendees: > > > > > > Joris Van den Bossche > > > Ian Cook > > > Nic Crane > > > Raul Cumplido > > > Ian Joiner > > > David Li > > > Rok Mihevc > > > Dragoș Moldovan-Grünfeld > > > Aldrin Montana > > > Weston Pace > > > Eduardo Ponce > > > Matthew Topol > > > Jacob Wujciak > > > > > > > > > Discussion: > > > > > > Eduardo: Draft PR with a guide showing how to create a new Arrow C++ > > > compute kernel [1] > > > - Review requested > > > > > > Weston: Proposed changes to ExecPlan in Arrow C++ compute engine [2] > > > - Feedback requested on details described in the Jira > > > > > > Rok: Temporal rounding kernels option in Arrow C++ compute engine [3] > > > - Feedback requested about what we should name it > > > - Possibilities include ceil_on_boundary, ceil_is_strictly_greater, > > > strict_ceil, ceil_is_strictly_greater, is_strict_ceil, ceil_is_strict > > > - Joris favors ceil_is_strictly_greater > > > > > > Ian C: Discussion about naming the Arrow C++ engine [4] > > > - Comments welcome on the mailing list > > > > > > David: ADBC (Arrow Database Connectivity) proposal [5][6] > > > - Feedback requested > > > > > > Ian C: Discussion about whether the community around Arrow would like > > > to have DataFrame-like APIs for Arrow in more languages, for example > > > C++ > > > - For C++, maybe this would look similar to xframe [7] > > > - Probably better to approach projects like these outside of Arrow > > > and have them produce plans in Substrait format [8] which the Arrow > > > C++ engine (and other engines) could consume and execute > > > > > > Arrow 8.0.0 release > > > - Most post-release tasks complete > > > - Please contribute to the release blog post [9] > > > > > > Release process > > > - Please comment on the proposed RC process change [10] > > > - There is a discussion about changing to a bimonthly major releases > > > (instead of quarterly which is what we do now) > > > - To make this work we could need nightly builds to be more stable; > > > Raul and Jacob are working on this > > > > > > Should we publicly share a link that Arrow developers can use to join > > > the Zuilp chat? > > > - Zulip has instructions for how to do this [11] > > > - We would need a Zulip admin to change the permissions to enable > > > this (Wes, Antonie, Weston, at al are admins) > > > - What about the ASF Slack [12] ? Should we share the details about > > that? > > > - The Slack has a rarely used Arrow channel and a Rust Arrow > > > channel which is more popular > > > - There were some doubts about whether committer permissions or the > > > associated apache.org email address are required to join, but in fact > > > anyone can join this Slack > > > - Ian will follow up about this > > > > > > The Data Thread [13] > > > - Voltron Data is hosting an Arrow-focused virtual conference on June > 23 > > > - Registration and speaker applications are open > > > > > > [1] https://github.com/apache/arrow/pull/10296 > > > [2] https://issues.apache.org/jira/browse/ARROW-16522 > > > [3] > > > https://github.com/apache/arrow/pull/12657/files#diff-6bc7ecec6a4f7bcefc2511cde3bd809340ad0d94bb8f7cc5f4994063c798f2faR124-R132 > > > [4] https://lists.apache.org/thread/02sdm4jmqg2z98kr1mg2yq13q858xbx6 > > > [5] https://lists.apache.org/thread/gnz1kz2rj3rb8rh8qz7l0mv8lvzq254w > > > [6] > > > https://docs.google.com/document/d/1t7NrC76SyxL_OffATmjzZs2xcj1owdUsIF2WKL_Zw1U/ > > > [7] https://xframe.readthedocs.io/en/latest/index.html > > > [8] https://substrait.io > > > [9] https://github.com/apache/arrow-site/pull/207 > > > [10] https://lists.apache.org/thread/g6mqpyq2hc11xbgrq2pf653njzy53plt > > > [11] https://zulip.com/help/invite-new-users#create-an-invitation-link > > > [12] https://the-asf.slack.com/ > > > [13] https://thedatathread.com > > > > > > > > > On Wed, May 11, 2022 at 9:23 AM Ian Cook <i...@ursacomputing.com> > wrote: > > > > > > > > Hi all, > > > > > > > > Our biweekly sync call is today at 12:00 noon Eastern time. > > > > > > > > The Zoom meeting URL for this and other biweekly Arrow sync calls is: > > > > https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09 > > > > > > > > Alternatively, enter this information into the Zoom website or app to > > > > join the call: > > > > Meeting ID: 876 4903 3008 > > > > Passcode: 958092 > > > > > > > > Thanks, > > > > Ian > > >