Le 13/05/2022 à 16:30, Alessandro Molina a écrit :
I think Arrow should definitely consider adding a DataFrame-like API.
There are multiple reasons why exposing Arrow to end users instead of
restricting it to developers of framework would be beneficial for the Arrow
project itself.
A rough approximation of DataFrame like API has been growing during the
years anyway in many bindings and it's probably better to consolidate that
effort in a structured process.
I'm not sure about this. Different languages have different de facto
standards for dataframe APIs (e.g. Pandas for Python), so it may not be
wise to try to unify them all.
There's also an argument that Arrow C++ should focus on the fundamental
building blocks and let other people nifty APIs on top of this if they
want to.
Regards
Antoine.
The main thing I'm concerned about is adding one more interface for users.
If we want to grow DataFrame like APIs we should grow them on top of
Dataset (Table probably wouldn't give us enough memory management
flexibility) as for most users it's already confusing enough to understand
why they should use Table or Dataset. Figure if we add one more tabular
data structure.
On Thu, May 12, 2022 at 7:14 PM Wes McKinney <wesmck...@gmail.com> wrote:
Discussion about whether the community around Arrow would like to have
DataFrame-like APIs for Arrow in more languages, for example C++
We've discussed this a bit on the mailing list in the past, see
https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/edit#heading=h.g70gstc7jq4h
for example. It's a complicated subject because the problems that need
solving in a "data frame library" are much more than defining an API —
they involve establishing execution and mutation/copy-on-write
semantics (the latter which has been a huge topic of discussion in the
pandas community, for example). The API would be driving an internal
data management logic engine (similar to pandas's internal logic
engine — but hopefully we could make something without as many
problems) which would manipulate chunks of in-memory and out-of-core
Arrow data internally.
I still would be interested in an Arrow-native "data frame library"
similar to the SFrame library that's part of Apple's (now defunct?)
Turi Create library [1]
It's a can of worms but a problem not approached lightly (thinking of
that "one does not simply..." meme right now) and best done in heavy
consultation with communities that have experience supporting
production use of data frames for data science use cases for many
years.
[1]: https://github.com/apple/turicreate
On Wed, May 11, 2022 at 11:38 PM Ian Cook <i...@ursacomputing.com> wrote:
Attendees:
Joris Van den Bossche
Ian Cook
Nic Crane
Raul Cumplido
Ian Joiner
David Li
Rok Mihevc
Dragoș Moldovan-Grünfeld
Aldrin Montana
Weston Pace
Eduardo Ponce
Matthew Topol
Jacob Wujciak
Discussion:
Eduardo: Draft PR with a guide showing how to create a new Arrow C++
compute kernel [1]
- Review requested
Weston: Proposed changes to ExecPlan in Arrow C++ compute engine [2]
- Feedback requested on details described in the Jira
Rok: Temporal rounding kernels option in Arrow C++ compute engine [3]
- Feedback requested about what we should name it
- Possibilities include ceil_on_boundary, ceil_is_strictly_greater,
strict_ceil, ceil_is_strictly_greater, is_strict_ceil, ceil_is_strict
- Joris favors ceil_is_strictly_greater
Ian C: Discussion about naming the Arrow C++ engine [4]
- Comments welcome on the mailing list
David: ADBC (Arrow Database Connectivity) proposal [5][6]
- Feedback requested
Ian C: Discussion about whether the community around Arrow would like
to have DataFrame-like APIs for Arrow in more languages, for example
C++
- For C++, maybe this would look similar to xframe [7]
- Probably better to approach projects like these outside of Arrow
and have them produce plans in Substrait format [8] which the Arrow
C++ engine (and other engines) could consume and execute
Arrow 8.0.0 release
- Most post-release tasks complete
- Please contribute to the release blog post [9]
Release process
- Please comment on the proposed RC process change [10]
- There is a discussion about changing to a bimonthly major releases
(instead of quarterly which is what we do now)
- To make this work we could need nightly builds to be more stable;
Raul and Jacob are working on this
Should we publicly share a link that Arrow developers can use to join
the Zuilp chat?
- Zulip has instructions for how to do this [11]
- We would need a Zulip admin to change the permissions to enable
this (Wes, Antonie, Weston, at al are admins)
- What about the ASF Slack [12] ? Should we share the details about
that?
- The Slack has a rarely used Arrow channel and a Rust Arrow
channel which is more popular
- There were some doubts about whether committer permissions or the
associated apache.org email address are required to join, but in fact
anyone can join this Slack
- Ian will follow up about this
The Data Thread [13]
- Voltron Data is hosting an Arrow-focused virtual conference on June 23
- Registration and speaker applications are open
[1] https://github.com/apache/arrow/pull/10296
[2] https://issues.apache.org/jira/browse/ARROW-16522
[3]
https://github.com/apache/arrow/pull/12657/files#diff-6bc7ecec6a4f7bcefc2511cde3bd809340ad0d94bb8f7cc5f4994063c798f2faR124-R132
[4] https://lists.apache.org/thread/02sdm4jmqg2z98kr1mg2yq13q858xbx6
[5] https://lists.apache.org/thread/gnz1kz2rj3rb8rh8qz7l0mv8lvzq254w
[6]
https://docs.google.com/document/d/1t7NrC76SyxL_OffATmjzZs2xcj1owdUsIF2WKL_Zw1U/
[7] https://xframe.readthedocs.io/en/latest/index.html
[8] https://substrait.io
[9] https://github.com/apache/arrow-site/pull/207
[10] https://lists.apache.org/thread/g6mqpyq2hc11xbgrq2pf653njzy53plt
[11] https://zulip.com/help/invite-new-users#create-an-invitation-link
[12] https://the-asf.slack.com/
[13] https://thedatathread.com
On Wed, May 11, 2022 at 9:23 AM Ian Cook <i...@ursacomputing.com> wrote:
Hi all,
Our biweekly sync call is today at 12:00 noon Eastern time.
The Zoom meeting URL for this and other biweekly Arrow sync calls is:
https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09
Alternatively, enter this information into the Zoom website or app to
join the call:
Meeting ID: 876 4903 3008
Passcode: 958092
Thanks,
Ian