from:"Alessandro Molina"

Re: [DISCUSS] Statistics through the C data interface

2024-06-06 Thread Alessandro Molina

I brought it up on Github, but writing here too to avoid spawning too many
threads.
https://github.com/apache/arrow/issues/38837#issuecomment-2145343755

It's not something we have to address now, but it would be great if we
could design a solution that can be extended in the future to add Par-Batch
statistics in ArrowArrayStream.

While it's true that in most cases the producer code will be applying the
filtering, in the case of C-Data we can't take that for granted. There
might be cases where the consumer has no control over the filtering that
the producer would apply and the producer might not be aware of the
filtering that the consumer might want to do.

In those cases providing the statistics per-batch would allow the consumer
to skip the batches it doesn't care about, thus giving the opportunity for
a fast path.

On Thu, Jun 6, 2024 at 11:42 AM Antoine Pitrou  wrote:

>
> Hi Kou,
>
> Thanks for pushing for this!
>
> Le 06/06/2024 à 11:27, Sutou Kouhei a écrit :
> > 4. Standardize Apache Arrow schema for statistics and
> > transmit statistics via separated API call that uses the
> > C data interface
> [...]
> >
> > I think that 4. is the best approach in these candidates.
>
> I agree.
>
> > If we select 4., we need to standardize Apache Arrow schema
> > for statistics. How about the following schema?
> >
> > 
> > Metadata:
> >
> > | Name   | Value | Comments |
> > ||---|- |
> > | ARROW::statistics::version | 1.0.0 | (1)  |
>
> I'm not sure this is useful, but it doesn't hurt.
>
> Nit: this should be "ARROW:statistics:version" for consistency with
> https://arrow.apache.org/docs/format/Columnar.html#extension-types
>
> > Fields:
> >
> > | Name   | Type  | Comments |
> > ||---|  |
> > | column | utf8  | (2)  |
> > | key| utf8 not null | (3)  |
>
> 1. Should the key be something like `dictionary(int32, utf8)` to make
> the representation more efficient where there are many columns?
>
> 2. Should the statistics perhaps be nested as a map type under each
> column to avoid repeating `column`, or is that overkill?
>
> 3. Should there also be room for multi-column statistics (such as
> cardinality of a given column pair), or is it too complex for now?
>
> Regards
>
> Antoine.
>

Re: [DISCUSS] Versioning and releases for apache/arrow components

2024-04-08 Thread Alessandro Molina

On Sun, Apr 7, 2024 at 3:06 PM Andrew Lamb  wrote:

>
> We have had separate releases / votes for Arrow Rust (and Arrow DataFusion)
> and it has served us quite well. The version schemes have diverged
> substantially from the monorepo (we are on version 51.0.0 in arrow-rs, for
> example) and it doesn't seem to have caused any large confusion with users
>
>
I think that versioning will require additional thinking for libraries like
PyArrow, Java etc...
For rust this is a non problem because there is no link to the C++ library,

PyArrow instead is based on what the C++ library provides,
so there is a direct link between the features provided by C++ in a
specific version
and the features provided in PyArrow at a specific version.

More or less PyArrow 20 should have the same bug fixes that C++ 20 has,
and diverging the two versions would lead to confusion easily.
Probably major versions should match between C++ and PyArrow, but I guess
we could have diverging minor and patch versions. Or at least patch
versions given that
a new minor version is usually cut for bug fixes too.

Re: [DISCUSS] Conventions for transporting Arrow data over HTTP

2024-02-02 Thread Alessandro Molina

On Wed, Dec 6, 2023 at 7:45 PM Ian Cook  wrote:

>
> I am interested to hear more perspectives on this. My perspective is
> that we should recommend using HTTP conventions to keep clean
> separation between the Arrow-formatted binary data payloads and the
> various application-specific fields. This can be achieved by encoding
> application-specific fields in URI paths, query parameters, headers,
> or separate parts of multipart/form-data messages.
>

Submitting big binary data in POST messages via multipart/form-data is
usually not very performant,
in theory the boundary of the message has to be constructed by verifying
that it does not collide with
the content of the data itself. Which for huge files means traversing the
whole file in search of the bytes
matching the boundary.
Many implementation are optimistic based on the fact that there are very
little
chances that a long enough randomly generated boundary will be contained in
the message, but this is
not guaranteed to be true and I would refrain from suggesting an approach
that, even though it's remote,
has a chance of being slow or not working.

Also most HTTP servers tend to implement a maximum request time to reduce
the risk of exhausting the maximum
available connections with broken (or malicious) clients that leave the
connection open for too long.
So uploading a 1GB file in a single POST is at serious risk of failing in
most deployments.

There is also the issue that for multipart/form-data a maximum transferred
data size exists as the content of files is frequently saved
in a temporary file by the HTTP server before it gets forwarded to the
server side application. Thus opening
the system for an out of disk error if a client uploads too big data and no
limit is configured.

So I would suggest that any recommended approach to submit Arrow data via
HTTP relies on Content-Range and chunked uploads
to transmit the data, thus reducing the risk of timeouts or size limits.
And allowing to simply resend a chunk in case of those.

Re: Proposal: add a bot to close PRs that haven't been updated in 30 days

2023-03-31 Thread Alessandro Molina

I think that marking them drafts could be a good way to reduce the overload
for people having to review PRs,
drafts can easily be filtered out in github searches.

> I am personally not a huge fan of auto-closing PRs. Especially not
> after a short period like 30 days (I think that's too short for an
> open source project), and we have to be careful with messaging. Very
> often such a PR is "stale" because it is waiting for reviews.

Well, I think 30 days would be since the last update to the PR, not 30 days
since it was opened.
My question probably would be... If a PR was sitting ignored for 30 days
without anyone from the community feeling the need to review and merge it
and without its primary author feeling the need to push for getting it
merged. Isn't that a signal that both parts consider that PR not important?

Anyway 30 days was just a random value, it could be 60 or anything else. We
had PRs that have been open without any comment or update for 120+ days.

I like Will's proposal of sending one ping to the author and reviewers, and
if there is no feedback after 30 days from the ping we can just close the
PR.
I would even make the ping shorter, 10 days without any update to a PR is
already a time long enough to signal the person might have forgotten about
it and a ping might bring it up on top of his mind again.

On Fri, Mar 31, 2023 at 5:23 PM Aldrin  wrote:

> I have some PRs that have been open for awhile and I changed them to be
> draft PRs (I think that makes them clutter fewer views while I leave them
> open).
>
> I'm just curious if draft PRs are as low cost (low cognitive load) as I
> think they
> are and if instead of closing them the bot can make a PR a draft PR? In
> general
> I agree with the general direction of the discussion otherwise.
>
> Aldrin Montana
> Computer Science PhD Student
> UC Santa Cruz
>
>
> On Fri, Mar 31, 2023 at 7:49 AM Will Jones 
> wrote:
>
> > > Also good to know: contributors apparently can't re-open PRs if it was
> > > closed by someone else, so we have to be careful with messages like
> > > "feel free to reopen".
> >
> > Thanks for bringing this up, Joris. That does make closing via bot much
> > less appealing to me.
> >
> > I like your idea of (1) having the bot provide a friendly message asking
> > the contributor whether they plan to continue their work (and maybe
> provide
> > suggestions on how to get reviewer attention if needed) and (2) if there
> is
> > no response to that message after 30 days, we can then close the PR.
> >
> >
> >
> > On Fri, Mar 31, 2023 at 3:57 AM Joris Van den Bossche <
> > jorisvandenboss...@gmail.com> wrote:
> >
> > > I am personally not a huge fan of auto-closing PRs. Especially not
> > > after a short period like 30 days (I think that's too short for an
> > > open source project), and we have to be careful with messaging. Very
> > > often such a PR is "stale" because it is waiting for reviews. I know
> > > we have the labels now that could indicate this, but those are not
> > > (yet) bullet proof (for example, if I quickly answer to one comment it
> > > will already be marked as "awaiting changes", while in fact it might
> > > still be waiting on actual review). I think in general it is difficult
> > > to know the exact reason why something is stale, a good reason to be
> > > careful with automated actions that can be perceived as unfriendly.
> > >
> > > Personally, I think commenting on a PR instead of closing it might be
> > > a good alternative, if we craft a good and helpful message. That can
> > > act as a useful reminder, both towards the author as maintainer, and
> > > can also *ask* to close if they are not planning to further work on it
> > > (and for example, we could still auto-close PRs if nothing happened
> > > (no push, no comment, ..) on such a PR after an additional period of
> > > time).
> > >
> > > Also good to know: contributors apparently can't re-open PRs if it was
> > > closed by someone else, so we have to be careful with messages like
> > > "feel free to reopen".
> > >
> > > On Thu, 30 Mar 2023 at 23:11, Will Jones 
> > wrote:
> > > >
> > > > I'm +0 on the reviewer bot pings. Closing PRs where the author hasn't
> > > > updated in 30 days is something a maintainer would have to do
> anyways,
> > so
> > > > it seems like a useful automation. And there's only one author, so
> it's
> > > > guaranteed to ping the right person. Things are not so clean with
> > > reviewers.
> > > >
> > > > With the labels and codeowners file [1] I think we have supplied
> > > sufficient
> > > > tools so that each subproject in the monorepo can manage their review
> > > > process in their own way. For example, I have a bookmark that takes
> me
> > > to a
> > > > filtered view of PRs that only shows me the C++ Parquet ones that are
> > > ready
> > > > for review [2]. I'd encourage each reviewer to have a similar view of
> > the
> > > > project that they regularly check.
> > > >
> > > > [1]

Re: Plasma will be removed in Arrow 12.0.0

2023-03-17 Thread Alessandro Molina

How does PyArrow cope with multiprocessing.Manager? I remember there were
some inefficiencies when Pickle was used (mostly related to slicing) but
that in theory it should work.
That is probably an easy enough replacement for Plasma and is standard.

On Wed, Mar 15, 2023 at 10:24 PM Will Jones  wrote:

> Hello all,
>
> First, a reminder that Plasma has been deprecated and will be removed in
> the 12.0.0 release of the C++, Python, and Java Arrow libraries. [1]
>
> I know some used Plasma as a convenient way to share Arrow data between
> Python processes, so I pulled together a quick performance comparison
> against two supported alternatives: Flight over unix domain socket and the
> Python sharedmemory module. [2] The shared memory example performs
> comparably to Plasma, but I don't think is accessible from other languages.
> The Flight test is slower than shared memory, but still fairly fast, and of
> course works across languages. I wrote a little more about the shared
> memory case in a stackoverflow answer [3].
>
> If you have migrated off of Plasma and want to share with other users what
> you moved to, please do so in this thread.
>
> Best,
>
> Will Jones
>
> [1] https://github.com/apache/arrow/issues/33243
> [2] https://github.com/wjones127/arrow-ipc-bench
> [3] https://stackoverflow.com/a/75402621/2048858
>

Re: [VOTE] Disable ASF Jira issue reporting

2022-11-25 Thread Alessandro Molina

+1 as far as for "now" we actually mean "as soon as the necessary scripts
have been ported to github"

I mean, I doubt the plan is to disable jira before we can actually ship PRs
from github issues and thus block development.



Il Mer 23 Nov 2022, 22:37 Todd Farmer  ha
scritto:

> Hello,
>
> I would like to propose that issue reporting in ASF Jira for the Apache
> Arrow project be disabled, and all users directed to use GitHub issues for
> reporting going forward. GitHub issue reporting is now enabled [1] in
> response to a recent Infra policy change eliminating self-service user
> registration for ASF Jira accounts. The Apache Arrow project has already
> voted in support of migrating issue tracking from ASF Jira to GitHub issues
> [2], and migration work is ongoing [3].
>
> Disabling ASF Jira issue reporting will move all such work to GitHub
> issues. I expect that usage of this new platform by all participants - not
> just new community members lacking ASF Jira accounts - will expedite
> further discovery and improvements to this platform. Furthermore, this will
> prevent new users from being routed to a new, and potentially "lesser",
> issue reporting experience.
>
> Please note that this proposal does NOT move work on existing ASF Jira
> issues to GitHub - that work should continue in Jira until issues are
> migrated and the Jira system set to read-only. There will be a separate
> discussion when that activity is ready.
>
> The vote will be open for at least 72 hours.
>
> [ ] +1 Disable issue reporting on ASF Jira for the Apache Arrow project
> [ ] -1 Leave issue reporting enabled on ASF Jira for the Apache Arrow
> project because...
>
> [1] https://github.com/apache/arrow/issues/new/choose
> [2] https://lists.apache.org/thread/l545m95xmf3w47oxwqxvg811or7b93tb
> [3]
>
> https://docs.google.com/document/d/1UaSJs-oyuq8QvlUPoQ9GeiwP19LK5ZzF_5-HLfHDCIg/edit?usp=sharing
>
> Todd Farmer
>

Re: Parser for ExecPlans

2022-11-08 Thread Alessandro Molina

To be honest I find this YAML based representation a bit confusing due to
the unclear parameters of functions.
In your specific example you have a JOIN taking two sources as their
inputs.
But how do I know that the two sources are meant to be inputs to the join?
And not only that the last source is the input?
Obviously for someone knowledgeable about how Acero works, it is obvious
that Join takes two inputs, but it's still a bit unclear which one those
two inputs are

I agree with the point of using a easily parsable/writable language like
JSON or YAML, as that would more easily allow developers to construct
pipelines at runtime in their own favorite language and then compiling them
to those. But at that point, aren't we reimplementing Substrait?

Another thing that came to my mind, the pipeline is written in a way that
fits the compiler more than a human.
Humans would probably design their pipeline starting from the data source
and then applying transformations to it as they think of the next step.
While here you need to think backward. Obviously you can append to the top
as you write your pipeline ,but that's still a bit counterintuitive.

Just my two cents.



On Thu, Nov 3, 2022 at 8:08 PM Weston Pace  wrote:

> Indentation works well when you omit the other arguments (e.g. ...)
> but once you mix in the arguments for the nodes (especially if those
> arguments have their own indentation / structure) then it ends up
> becoming unreadable I think.  I prefer the idea of each node having
> it's own block, with no indentation, and using indentation purely for
> argument structure.  For example (using YAML), consider the query
> `SELECT n_nationkey, n_name, r_name FROM nation INNER JOIN region ON
> n_regionkey = r_regionkey`.  Note, we don't have a serialization for
> datasets so I'm using substrait serialization for reads.
>
> ```
> project:
>   expressions:
>- "!0"
>- "!1"
>- "!2"
>   names:
>- "n_nationkey"
>- "n_name"
>- "r_name"
>
> join:
>   left_keys:
>- "!2"
>   right_keys:
>- "!4"
>   type: JOIN_TYPE_INNER
>
> read:
>   base_schema:
> names:
>  - "r_regionkey"
>  - "r_name"
>  - "r_comment"
>struct:
>  types:
>   - i32?
>   - string?
>   - string?
>   named_table:
> names:
>  - "region"
>
> read:
>   base_schema:
> names:
>  - "n_nationkey"
>  - "n_name"
>  - "n_regionkey"
>  - "n_comment"
> struct:
>   types:
> - i32?
> - string?
> - i32?
> - string?
>   named_table:
> names:
>   - "nation"
> ```
>
> I feel the above is pretty reasonable once you get past the learning
> curve of prefix processing to build the tree.
>
> It's not clear that node-level indentation adds much.
>
> ```
> project:
>   expressions:
>- "!0"
>- "!1"
>- "!2"
>   names:
>- "n_nationkey"
>- "n_name"
>- "r_name"
>
>   join:
> left_keys:
>  - "!2"
> right_keys:
>  - "!4"
> type: JOIN_TYPE_INNER
>
> read:
>   base_schema:
> names:
>  - "r_regionkey"
>  - "r_name"
>  - "r_comment"
>struct:
>  types:
>   - i32?
>   - string?
>   - string?
>   named_table:
> names:
>  - "region"
>
> read:
>   base_schema:
> names:
>  - "n_nationkey"
>  - "n_name"
>  - "n_regionkey"
>  - "n_comment"
>struct:
>   types:
> - i32?
> - string?
> - i32?
> - string?
>   named_table:
> names:
>   - "nation"
> ```
>
> And then I think adding parentheses doesn't make sense.  I suppose you
> could change from YAML to something like pythons or JS's formats for
> array and dict literals but I think it would be quite messy.
>
> On Thu, Nov 3, 2022 at 11:07 AM Percy Camilo Triveño Aucahuasi
>  wrote:
> >
> > Thanks Sasha!
> >
> > A nice advantage about parentheses is that most editors can track and
> > highlight the sections between them.
> > Also, those parentheses can be optional when we detect new lines (in the
> > case some users don't want to deal with many parentheses); in that case,
> we
> > would just need to ask indentation.
> >
> > Percy
> >
> >
> > On Thu, Nov 3, 2022 at 12:47 PM Sasha Krassovsky <
> krassovskysa...@gmail.com>
> > wrote:
> >
> > > Hi Percy,
> > > Thanks for the input! New lines would be no problem at all, they’d
> just be
> > > treated the same as any other whitespace. One thing to point out about
> the
> > > function call style when written that way is that it looks a lot like
> the
> > > list style, it’s just that there are more parentheses to keep track of,
> > > though it does make it more obvious what delineates a subtree.
> > >
> > > Sasha
> > >
> > >
> > > > 3 нояб. 2022 г., в 10:35, Percy Camilo Triveño Aucahuasi <
> > > percy.camilo...@gmail.com> написал(а):
> > > >
> > > > Hi Sasha,
> > > >
> > > > I like the

Re: [DISCUSS] Move issue tracking to

2022-10-25 Thread Alessandro Molina

On Tue, Oct 25, 2022 at 1:55 AM Joris Van den Bossche <
jorisvandenboss...@gmail.com> wrote:

>
> I think the main thing we will miss are the Links (relation between
> issues), but we can try to promote some consistent usage of adding
> "Duplicate of #...", "Related to #..." in top post of an issue when
> appropriate.
>

If we plan to migrate to GitHub, I think we should add an Issue Template (
https://docs.github.com/en/communities/using-templates-to-encourage-useful-issues-and-pull-requests/configuring-issue-templates-for-your-repository
) to make sure we don't proliferate too many ways to do the same thing in
terms of categorizing issues properly

Re: [C++] Adding Run-Length Encoding to Arrow

2022-06-08 Thread Alessandro Molina

RLE would probably have some benefits that it makes sense to evaluate, I
would personally go in the direction of having a minimal benchmarking suite
for some of the cases where we expect to seem most benefit (IE: filtering)
so we can discuss with real numbers.

Also, the currently proposed format divides run lengths and values, maybe a
format where run lengths and values are stored interleaved in the same
buffer might be able to allow more optimisations in the contest of
vectorised operations. Even though it might be harder to work with for
things that are not fixed width.

On Tue, Jun 7, 2022 at 7:56 PM Tobias Zagorni 
wrote:

> I created a Jira for adding RLE as ARROW-16771, and draft PRs:
>
> - https://github.com/apache/arrow/pull/13330
>   Encode/Decode functions for (currently fixed width types only)
>
> - https://github.com/apache/arrow/pull/1
>   For updating docs
>
> Best,
> Tobias
>
> Am Dienstag, dem 31.05.2022 um 17:13 -0500 schrieb Wes McKinney:
> > I haven't had a chance to look at the branch in detail, but if you
> > can
> > provide a pointer to a specification or other details about the
> > proposed memory format for RLE (basically: what would be added to the
> > columnar documentation as well as the Flatbuffers schema files), it
> > would be helpful so it can be circulated to some other interested
> > parties working primarily outside of Arrow (e.g. DuckDB) who might
> > like to converge on a standard especially given that it would be
> > exported across the C data interface. Thanks!
>
>

Re: Arrow sync call May 11 at 12:00 US/Eastern, 16:00 UTC

2022-05-13 Thread Alessandro Molina

I think Arrow should definitely consider adding a DataFrame-like API.

There are multiple reasons why exposing Arrow to end users instead of
restricting it to developers of framework would be beneficial for the Arrow
project itself.

A rough approximation of DataFrame like API has been growing during the
years anyway in many bindings and it's probably better to consolidate that
effort in a structured process.
The main thing I'm concerned about is adding one more interface for users.
If we want to grow DataFrame like APIs we should grow them on top of
Dataset (Table probably wouldn't give us enough memory management
flexibility)  as for most users it's already confusing enough to understand
why they should use Table or Dataset. Figure if we add one more tabular
data structure.

On Thu, May 12, 2022 at 7:14 PM Wes McKinney  wrote:

> > Discussion about whether the community around Arrow would like to have
> DataFrame-like APIs for Arrow in more languages, for example C++
>
> We've discussed this a bit on the mailing list in the past, see
>
>
> https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/edit#heading=h.g70gstc7jq4h
>
> for example. It's a complicated subject because the problems that need
> solving in a "data frame library" are much more than defining an API —
> they involve establishing execution and mutation/copy-on-write
> semantics (the latter which has been a huge topic of discussion in the
> pandas community, for example). The API would be driving an internal
> data management logic engine (similar to pandas's internal logic
> engine — but hopefully we could make something without as many
> problems) which would manipulate chunks of in-memory and out-of-core
> Arrow data internally.
>
> I still would be interested in an Arrow-native "data frame library"
> similar to the SFrame library that's part of Apple's (now defunct?)
> Turi Create library [1]
>
> It's a can of worms but a problem not approached lightly (thinking of
> that "one does not simply..." meme right now) and best done in heavy
> consultation with communities that have experience supporting
> production use of data frames for data science use cases for many
> years.
>
> [1]: https://github.com/apple/turicreate
>
> On Wed, May 11, 2022 at 11:38 PM Ian Cook  wrote:
> >
> > Attendees:
> >
> > Joris Van den Bossche
> > Ian Cook
> > Nic Crane
> > Raul Cumplido
> > Ian Joiner
> > David Li
> > Rok Mihevc
> > Dragoș Moldovan-Grünfeld
> > Aldrin Montana
> > Weston Pace
> > Eduardo Ponce
> > Matthew Topol
> > Jacob Wujciak
> >
> >
> > Discussion:
> >
> > Eduardo: Draft PR with a guide showing how to create a new Arrow C++
> > compute kernel [1]
> >  - Review requested
> >
> > Weston: Proposed changes to ExecPlan in Arrow C++ compute engine [2]
> >  - Feedback requested on details described in the Jira
> >
> > Rok: Temporal rounding kernels option in Arrow C++ compute engine [3]
> >  - Feedback requested about what we should name it
> >  - Possibilities include ceil_on_boundary, ceil_is_strictly_greater,
> > strict_ceil, ceil_is_strictly_greater, is_strict_ceil, ceil_is_strict
> >  - Joris favors ceil_is_strictly_greater
> >
> > Ian C: Discussion about naming the Arrow C++ engine [4]
> >  - Comments welcome on the mailing list
> >
> > David: ADBC (Arrow Database Connectivity) proposal [5][6]
> >  - Feedback requested
> >
> > Ian C: Discussion about whether the community around Arrow would like
> > to have DataFrame-like APIs for Arrow in more languages, for example
> > C++
> >  - For C++, maybe this would look similar to xframe [7]
> >  - Probably better to approach projects like these outside of Arrow
> > and have them produce plans in Substrait format [8] which the Arrow
> > C++ engine (and other engines) could consume and execute
> >
> > Arrow 8.0.0 release
> >  - Most post-release tasks complete
> >  - Please contribute to the release blog post [9]
> >
> > Release process
> >  - Please comment on the proposed RC process change [10]
> >  - There is a discussion about changing to a bimonthly major releases
> > (instead of quarterly which is what we do now)
> >  - To make this work we could need nightly builds to be more stable;
> > Raul and Jacob are working on this
> >
> > Should we publicly share a link that Arrow developers can use to join
> > the Zuilp chat?
> >  - Zulip has instructions for how to do this  [11]
> >  - We would need a Zulip admin to change the permissions to enable
> > this (Wes, Antonie, Weston, at al are admins)
> >  - What about the ASF Slack [12] ? Should we share the details about
> that?
> >- The Slack has a rarely used Arrow channel and a Rust Arrow
> > channel which is more popular
> >- There were some doubts about whether committer permissions or the
> > associated apache.org email address are required to join, but in fact
> > anyone can join this Slack
> >  - Ian will follow up about this
> >
> > The Data Thread [13]
> >  - Voltron Data is hosting an Arrow-focused virtual

Re: [DISCUSS][C++][Python]Switch default mmap behaviour to off

2022-05-11 Thread Alessandro Molina

As far as I understood, the idea is not to fully remove memory mapping,
just turn the current mmap=True default arguments to mmap=False

The goal is mostly to provide consistent behaviour for end users. At the
moment users might face very different performances when they read locally
or on a network filesystem like NFS, because we will try to use memory
mapping on both. But there were users reports that trying to memory map on
NFS lead to terrible performances. By disabling memory mapping by default
we can offer a more consistent experience to users.

By default switching memory mapping off when reading/writing formats
shouldn't influence much local performances, as most formats need to go
through a decode phase and thus won't benefit much from memory mapping. The
only format where mmap can really be effective is the IPC one. And in that
case if users know what they are doing, they can still pass mmap=True.

We would still keep memory mapping enabled for some features. For example
in the future we might implement spillover of datasets, in such case the
spillover would probably rely on memory mapping.

On Fri, May 6, 2022 at 10:09 AM Sasha Krassovsky 
wrote:

> Hi,
> Which use of mmap are you referring to in the code base? Mmap in general
> could have a lot of different uses. The point of the paper you linked is
> that database management systems should explicitly manage their paging to
> and from disk to maintain transactional consistency or to avoid performance
> penalties if the working set doesn’t fit in memory. Arrow doesn’t care
> about the former. As for the latter, something like IPC might make good use
> of mmap. It could be mot even writing to a real file on disk but to a
> stream or even to another process’s address space. In that scenario mmap
> definitely does make sense.
>
> That’s not to say this isn’t something worth discussing, but I feel the
> paper’s results are much more nuanced than “we should remove mmap because
> mmap is bad”. It would help to have some specific instances to look at to
> see if it makes sense to switch to something else.
>
> Sasha Krassovsky
>
> > 5 мая 2022 г., в 23:03, Alvin Chunga Mamani 
> написал(а):
> >
> > Hi all,
> > I start this discussion to comment on the change to disable the use of
> mmap
> > by default, which represents a risk in non-local/pseudo file systems that
> > can affect performance.
> > Part of the solution would be to have a flag at the compilation level
> that
> > allows you to activate or deactivate the use of mmap in arrow
> C++/pyarrow.
> > Here in [1] an analysis on the use of mmap in Database Management System
> is
> > presented
> >
> >
> > Thanks.
> >
> > [1] https://db.cs.cmu.edu/papers/2022/cidr2022-p13-crotty.pdf
>

Re: [DISC] (Python) Dropping support for manylinux2010

2022-05-05 Thread Alessandro Molina

non binding +1

On Thu, May 5, 2022 at 1:02 PM Jacob Wujciak  wrote:

> Hi all,
>
> I would like to propose that we drop support for manylinux2010.
>
> CentoOS 6, on which the manylinux2010 image is based, has been EOL for over
> two years [1].
> There is now also an official announcement by pypa that
> manylinux2010 support will be dropped sometime in 2022 [2] that has not
> received any feedback on either Github or Discourse. Since the announcement
> the percentage of affected users has dropped from ~7% to ~4% of Python 3.7
> users [3]. ~52% of pyarrow users are on 3.7 so only ~2% of pyarrow users
> [4] would potentially be affected by an issue fixed by updating pip [5].
>
> We have already had several CI issues with the manylinux2010 builds that
> required workarounds. There are now also issues with verification as there
> are no aarch64 wheels for manylinux2010 [6].
>
> [1]:
> https://access.redhat.com/support/policy/updates/errata/#Life_Cycle_Dates
> [2]: https://github.com/pypa/manylinux/issues/1281
> [3]: https://mayeut.github.io/manylinux-timeline/
> [4]: https://pypistats.org/packages/pyarrow
> [5]: https://pip.pypa.io/en/stable/news/#v19-3
> [6]: https://issues.apache.org/jira/browse/ARROW-16476
>
> Thanks,
> Jacob
>

Re: [DISC] (Java) Add Windows binaries to Maven packages

2022-05-04 Thread Alessandro Molina

The proposal seems reasonable to me, we should do our best at providing
users the same experience on the various systems whenever possible.

As long as we don't receive complaints about the package size, I think we
can live with it. If it becomes a problem for our users, we can always make
per-system binaries in the future.

PS: I think you forgot to enable comments on the google docs, that's
something you usually want to allow as it eases providing feedback.

On Tue, May 3, 2022 at 4:19 PM Larry White  wrote:

> Hi all,
>
> Please see
>
> https://docs.google.com/document/d/1y25kRrXlORnUD9p7wTMOWjC6wONEyI9rU-Pv4q1udZ8/edit?usp=sharing
> for a copy of this email with proper formatting.
>
> thanks.
>
> On Mon, May 2, 2022 at 4:23 PM Larry White  wrote:
>
> > Hi all,
> >
> >
> > I would like to request your feedback on incorporating Windows binaries
> in
> > those Maven packages that have native Arrow dependencies, while drawing
> > your attention to the likely impact on jar size.
> >
> >
> > Five of the 23 arrow packages on Maven Central have native dependencies.
> > Four of those five have bundled native libraries included in the maven
> > package jar itself. (The exception is the plasma package.) For the
> others,
> > both .so (Linux shared-object) and .dylib (OSX dynamic library) files are
> > provided in the same jar. Windows native libraries are not included.
> >
> >
> > The packages in question are:
> >
> >-
> >
> >arrow-dataset
> >
> >
> >-
> >
> >arrow-orc
> >-
> >
> >arrow-c
> >-
> >
> >Arrow-gandiva
> >
> >
> > For developers using Arrow on OSX or Linux, the experience using the
> > arrow-dataset jar with its bundled native library is the same as using a
> > pure Java library. Including Windows binaries in the jars would expand
> the
> > community of developers who could use Arrow features like datasets
> “out-of
> > the box.”
> >
> >
> > Moreover, it is not trivial for devs on Windows to create their own
> > solution. To the best of my knowledge, pre-compiled JNI DLLs are not
> > available for download, and there are no build scripts or instructions,
> > as there are for Linux and Mac users (see
> >
> https://arrow.apache.org/docs/dev/developers/java/building.html#building-arrow-jni-modules
> > ).
> > Effort
> >
> > To produce the JNI DLLs, the main effort will be to create new
> > Windows-focused build scripts similar to: *arrow
> > /ci
> > /scripts
> >  >/java_jni_macos_build.sh,
> > a*nd incorporate them into the larger build process.
> >
> >
> > Creating these build files is a prerequisite for the suggested packaging
> > changes, but is also desirable in its own right, even if the proposed
> > packaging change is not implemented.
> > File size concern
> >
> > The downside of including Windows binaries is that these files are large.
> > In the 7.0.0 release, the two native library files included in the
> dataset
> > jar total 78 MB on disk, which is roughly 100% of the total size of the
> > jar. See table below for more details.
> >
> > module
> >
> > .dylib (size in MB)
> >
> > .so (size in MB)
> >
> > Combined
> >
> > dataset
> >
> > 34.6
> >
> > 43.7
> >
> > 78.3
> >
> > ORC
> >
> > 29.3
> >
> > 37.9
> >
> > 67.2
> >
> > Gandiva
> >
> > 77.4
> >
> > 87.1
> >
> > 164.5
> >
> > c-data
> >
> > <1.0
> >
> > <1.0
> >
> > <`1.0
> >
> > Total
> >
> > 141.3
> >
> > 167.7
> >
> >
> >
> > It’s estimated that DLLs would be slightly larger than the dylib files,
> so
> > that the proposed change would increase the size of the dataset jar from
> > 78.3 MB to about 114 MB.
> >
> > For reference, here are the native Arrow libraries (.so) in a PyArrow
> > x86-64 wheel:
> >
> > Dataset
> >
> > 2.3
> >
> > Flight
> >
> > 13.0
> >
> > Python
> >
> > 2.1
> >
> > Python-flight
> >
> > 0.1
> >
> > Plasma
> >
> > 0.2
> >
> > Parquet
> >
> > 4.3
> >
> > Arrow
> >
> > 49.0
> >
> > Total
> >
> > 71.0
> >
> > Note that this isn't an apples-to-apples comparison: the PyArrow
> libraries
> > do not include Gandiva, while the Java libraries do not include Flight,
> > Plasma, Parque, or (presumably) some amount of the code in the Arrow
> file.
> >
> > As more C++ functionality is used by Java code the number of modules with
> > native dependencies may rise, and the size of the individual libraries
> may
> > increase.
> >
> > For the sake of simplicity, it is preferable to produce a single Jar for
> > each module that contains binaries for the three platforms: Windows, OSX,
> > and Linux. If file size is a significant concern, there are several
> options:
> >
> >
> >
> >-
> >
> >Stripping some symbols (`strip -x`) on the Linux dataset JNI library
> >brings it down from 43 to 34 MB, at the cost of debug information. It
> may
> >be worth considering this option for release builds.
> >-
> >
> >It may be possible to combine modules to

Re: Arrow sync call March 2 at 12:00 US/Eastern, 17:00 UTC

2022-03-02 Thread Alessandro Molina

Attendees:


Alessandro Molina

Micah Kornfield

David Li

Joris Van Den Bossche



Discussion:


Flight SQL Optimization for Small Results

 - Reference to
https://databricks.com/blog/2021/08/11/how-we-achieved-high-bandwidth-connectivity-with-bi-tools.html


 - Building directly in Flight as flight will benefit from it too.


Binary VS String KeyValue Metadata

  - Resuscitate the discussion on ML -> [DISCUSS] Binary Values in Key
value pairs

  - It is technically a breaking change and older clients expect the
previous format.


GeoData JSON vs Binary

  - Binary doesn’t seem very hard in the majority of cases

  - base64 might add unnecessary overhead


FlightSQL

  - Lack of documentation, not referenced from the website

  - Probably needing another community developer acting as champion for the
FlightSQL, David Li has been overseeing efforts, but the current
contributions seem more focused on internal usage.

On Wed, Mar 2, 2022 at 1:03 PM Ian Cook  wrote:

> Hi all,
>
> Our biweekly sync call is today at 12:00 noon Eastern time.
>
> The Zoom meeting URL for this and other biweekly Arrow sync calls is:
> https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09
>
> Alternatively, enter this information into the Zoom website or app to
> join the call:
> Meeting ID: 876 4903 3008
> Passcode: 958092
>
> Thanks,
> Ian
>

Re: [Discuss] Best practice for storing key-value metadata for Extension Types

2022-02-10 Thread Alessandro Molina

Mentioned this already to Joris, but want to make sure we don't miss it.

C-Data and thus ARROW:extension:metadata was mostly designed for shipping
data to different processes within the same host.
If we start using the spec for further uses, including saving it to files
that could be read across different architectures.
ARROW:extension:metadata doesn't in any way specify endianess of fields
like int32 num_items, int32 name_len etc... and that's something we must do
if we plan to ship that data through files or networks.

Json would definitely get rid of the endianess problem at the cost of a
greater size and a more complex parser. But there are superminimal json
parsers designed specifically for embedding like Jasmine (
https://github.com/zserge/jsmn )

On Wed, Feb 9, 2022 at 2:51 AM Dewey Dunnington 
wrote:

> I'll share a bit more about geospatial extension types that Joris
> mentioned. I'm new to the Arrow community and didn't know that there were
> any restrictions on metadata values (the C Data interface docs don't seem
> to indicate that there are restrictions, or if it's there I missed it!), so
> I used the same encoding for the ARROW:extension:metadata that's used to
> encode the parent metadata (int32 num_items, int32 name_len,
> char[name_len], int32 value_len, char[value_len],  etc..). I did this
> because I needed two key/value pairs (geodesic = true/false; crs =
> some_coordinate_reference_system) and already had the code to iterate over
> the parent metadata. I'm not saying that it's any pinnacle of elegant code
> (still very much a prototype), but it only takes about 30 lines of C to do
> this [1].
>
> I prototyped the extension types for geospatial using the C data interface,
> the idea being that a header-only helper file (geoarrow.hpp) could be
> distributed that would make it an attractive and easy alternative to
> well-known binary (WKB) to pass geometries around between libraries (e.g.,
> GEOS, GDAL, PROJ). Requiring anybody who uses an extension type to also
> vendor a JSON parser [2] seems a bit anti-social and restricts where that
> extension type is useful, although I understand that it's not the use case
> that many might have.
>
> There are definitely reasonable ways to do what I'm trying to do without
> resorting to a binary encoding, and JSON could probably even work...I'm
> just trying to share the use-case since it seems like this kind of
> environment isn't how folks envisioned extension types being used.
>
> [1]
>
> https://github.com/paleolimbot/geoarrow/blob/master/src/internal/geoarrow.hpp#L511-L542
> [2] The commonly vendored JSON parser in geospatial libraries is this one:
> https://github.com/nlohmann/json
>
> On Tue, Feb 8, 2022 at 7:58 PM Weston Pace  wrote:
>
> > I think I'm +0 but lean slightly towards JSON.
> >
> > In favor of binary I would guess that most extension types are going
> > to have relatively simple parameterization (to the point that
> > protobuf/flatbuffers isn't really needed).  For example, the substrate
> > consumer PR has five extension types at the moment (e.g. uuid,
> > varchar) and only two of them are parameterized and each of these by a
> > single int32_t.  It might be interesting to see what kinds of
> > extension types the geospatial community uses.
> >
> > That being said, this sort of parsing isn't really on any kind of
> > critical path.  It's very likely that users (not Arrow developers)
> > will be creating and working with extension types.  These users are
> > likely going to default to JSON (or pickle or XML).  If our "well
> > known types" use JSON then it will be more easily recognizable to
> > users what is going on.
> >
> > -Weston
> >
> > On Tue, Feb 8, 2022 at 8:14 AM Joris Van den Bossche
> >  wrote:
> > >
> > > On Tue, 8 Feb 2022 at 17:37, Jorge Cardoso Leitão <
> > jorgecarlei...@gmail.com>
> > > wrote:
> > >
> > > > ...
> > > >
> > > > Wrt to binary, imo the challenge is:
> > > > * we state that backward incompatible changes to the c data interface
> > > > require a new spec [1]
> > > >
> > >
> > > Note that this discussion wouldn't change anything about the C Data
> > > Interface spec itself. The discussion is only about the *value* that is
> > put
> > > in one of the key-value metadata fields. The C Data Interface spec
> > defines
> > > how the metadata needs to be stored, but doesn't specify anything about
> > the
> > > actual value of one of the key-value metadata fields.
> > >
> > >
> > > > * we state that the metadata is a binary string [2]
> > > > * a valid string is a subset of all valid byte arrays and thus
> > removing "
> > > > *string*" from the spec is backward incompatible
> > > >
> > > > If we write invalid utf8 to it and a reader assumes utf8 when reading
> > it,
> > > > we trigger undefined behavior.
> > > >
> > > > I was a bit surprised by ARROW-15613 - my understanding is that the
> c++
> > > > implementation is not following the spec, and if we at arrow2 were
> not
> > be
> > > > checking for utf8,

Re: Release 7.0.0 Retrospective

2022-02-02 Thread Alessandro Molina

For anyone interested this is the document that resulted from the Release
Retrospective

https://docs.google.com/document/d/1xnCWpEqznzcMu3meWUk4SZnFlEiw1dRfh249T1dMXfg/edit#



On Tue, Feb 1, 2022 at 11:30 PM Ian Joiner  wrote:

> Could you please share the retro board so that we can all comment on the
> issues?
>
> I’m very sorry for at least two of the redos were actually related to ORC
> with one of them caused by a simple import error I didn’t find in my Python
> docker tests which is caused by the fact that my old way to run PyArrow dev
> locally somehow broke around the 5.0.0 release. To prevent similar issues
> from happening I will keep asking questions here until my local env
> problems get fully resolved and both local tests and release verification
> work for the languages I develop in.
>
> I will try to attend the usual Biweekly meeting tomorrow.
>
> Ian
>
> > On Feb 1, 2022, at 9:23 AM, Alessandro Molina <
> alessan...@ursacomputing.com> wrote:
> >
> > For anyone interested on the topic, I got some feedbacks that suggest it
> > might be more effective to have a meeting dedicated to the topic with the
> > people who have been involved in preparing the release and subsequently
> > share the outcome of that meeting with everyone at the Arrow biweekly
> > meeting.
> >
> > Thus if you are interested in hearing about the release topic, feel free
> to
> > just join the usual Arrow Biweekly Meeting. The "post-mortem" will be
> > shared and discussed at that meeting too. The meeting that I shared will
> be
> > focused on producing the post-mortem together with those who have been
> > involved in preparing release 7.0.0 itself so that it can then be
> discussed
> > at the biweekly.
> >
> > On Tue, Feb 1, 2022 at 11:20 AM Alessandro Molina <
> > alessan...@ursacomputing.com> wrote:
> >
> >> Given the unexpected amount of tries we had to go through to publish
> >> version 7 (I don't think there were past cases where RC10 was reached),
> it
> >> would be helpful to go through what happened, what didn't work and what
> we
> >> can do to prevent it from happening again in the future.
> >>
> >> I created a meeting for tomorrow at 16:30 CET (15:30 UTC) that anyone is
> >> welcome to join to follow or participate in the discussion. The meeting
> >> link is at https://meet.google.com/jzg-gzks-qas
> >>
> >> After the meeting is completed I'll share a Google Docs with the notes
> of
> >> the meeting to make everyone aware of the outcome and what solutions
> have
> >> been proposed.
> >>
> >> Bests,
> >> Alessandro
> >>
>
>

Re: Release 7.0.0 Retrospective

2022-02-01 Thread Alessandro Molina

For anyone interested on the topic, I got some feedbacks that suggest it
might be more effective to have a meeting dedicated to the topic with the
people who have been involved in preparing the release and subsequently
share the outcome of that meeting with everyone at the Arrow biweekly
meeting.

Thus if you are interested in hearing about the release topic, feel free to
just join the usual Arrow Biweekly Meeting. The "post-mortem" will be
shared and discussed at that meeting too. The meeting that I shared will be
focused on producing the post-mortem together with those who have been
involved in preparing release 7.0.0 itself so that it can then be discussed
at the biweekly.

On Tue, Feb 1, 2022 at 11:20 AM Alessandro Molina <
alessan...@ursacomputing.com> wrote:

> Given the unexpected amount of tries we had to go through to publish
> version 7 (I don't think there were past cases where RC10 was reached), it
> would be helpful to go through what happened, what didn't work and what we
> can do to prevent it from happening again in the future.
>
> I created a meeting for tomorrow at 16:30 CET (15:30 UTC) that anyone is
> welcome to join to follow or participate in the discussion. The meeting
> link is at https://meet.google.com/jzg-gzks-qas
>
> After the meeting is completed I'll share a Google Docs with the notes of
> the meeting to make everyone aware of the outcome and what solutions have
> been proposed.
>
> Bests,
> Alessandro
>

Re: Managing usage of the @ApacheArrow Twitter handle and other social media

2022-02-01 Thread Alessandro Molina

I never used https://github.com/gr2m/twitter-together previously, in the
past I used Hootsuite to set up approval workflows, but I think that the
idea of setting up a workflow through github PRs looks like a good idea. It
would be able to leverage committer/pmc membership to merge the PRs and
would allow anyone to contribute with social media content.

On Tue, Feb 1, 2022 at 12:43 AM QP Hou  wrote:

> I don't know how other projects manage this, but one solution we could
> evaluate is using github PRs to manage the twitter account. For
> example, here is a github action that does exactly this
> https://github.com/gr2m/twitter-together.
>
> On Mon, Jan 31, 2022 at 3:14 PM Wes McKinney  wrote:
> >
> > hi all,
> >
> > The project is approaching it's 6th birthday and we have come a long way!
> >
> > We have a relatively seldom-used Twitter handle
> > twitter.com/ApacheArrow and only a handful of people in the community
> > have access to it. I know that Jacques and I do, but I am not sure who
> > else.
> >
> > I wanted to discuss a few things:
> >
> > * Giving more committers/PMC members access to the Twitter handle — I
> > think clearly there should be more people with access (I tweet through
> > TweetDeck, e.g. I just posted about a newly posted blog post)
> > * Consider if there are any other social media channels where we might
> > want to promote Arrow content
> > * Discuss a social media policy more broadly for the project
> >
> > On the latter point, my feelings are:
> >
> > * Promote content and usage of Apache Arrow, but not companies or
> > products (Apache projects are independent)
> > * Provide a way for the community to submit ideas/materials for social
> media
> >
> > Does anyone know if other ASF projects have policies/conventions about
> > how they decide how to use their social media properties to best serve
> > the community?
> >
> > Thanks,
> > Wes
>

Release 7.0.0 Retrospective

2022-02-01 Thread Alessandro Molina

Given the unexpected amount of tries we had to go through to publish
version 7 (I don't think there were past cases where RC10 was reached), it
would be helpful to go through what happened, what didn't work and what we
can do to prevent it from happening again in the future.

I created a meeting for tomorrow at 16:30 CET (15:30 UTC) that anyone is
welcome to join to follow or participate in the discussion. The meeting
link is at https://meet.google.com/jzg-gzks-qas

After the meeting is completed I'll share a Google Docs with the notes of
the meeting to make everyone aware of the outcome and what solutions have
been proposed.

Bests,
Alessandro

Re: Preparing for version 7.0.0 release

2022-01-13 Thread Alessandro Molina

The skeleton for the Release blog post has been created at
https://github.com/apache/arrow-site/pull/178/files

If anyone wants to prepare the part related to the environment/bindings
they work on it would greatly help. I'll do my best to make sure R, C++,
Python and Java parts are filled.

On Tue, Jan 4, 2022 at 3:27 PM Alessandro Molina <
alessan...@ursacomputing.com> wrote:

> Quick note that all "Unassigned" issues that were not already started have
> been moved to 8.0.0.
> End of next week I'll do another pass and move all "Improvements/New
> Features" that are not yet started to 8.0.0
>
> On Tue, Jan 4, 2022 at 10:02 AM Antoine Pitrou  wrote:
>
>>
>> Le 03/01/2022 à 15:44, Alessandro Molina a écrit :
>> > The plan seems to be to cut a release the 2nd or 3rd week of January, a
>> new
>> > confluence page was made to track progress of the release (
>> > https://cwiki.apache.org/confluence/display/ARROW/Arrow+7.0.0+Release
>> ).
>> >
>> > It would greatly help in the process of preparing for the release if you
>> > could review tickets that are assigned to you in the "TODO Backlog" and
>> > move those that you think you will not be able to close in ~1 week to
>> > "Version 8.0.0" in Jira, so that we can start preparing release
>> > announcements etc with a good estimate of what's actually going to end
>> up
>> > in the release.
>>
>> Note there's also the cpp-7.0.0 version on the Parquet JIRA:
>> https://issues.apache.org/jira/projects/PARQUET/versions/12350844
>>
>> Regards
>>
>> Antoine.
>>
>

Re: [RUST] Preparing for 7.0.0 release

2022-01-13 Thread Alessandro Molina

Hi Andrew, just wanted to update you on the fact that the skeleton for
v7.0.0 blog post has been created, so you can freely make changes in that
PR.

https://github.com/apache/arrow-site/pull/178/files

On Fri, Jan 7, 2022 at 12:20 AM Andrew Lamb  wrote:

> Greetings, fellow Rustaceans, and happy New Year!
>
> I am looking for feedback over the next day or so on:
> 1. A PR[1] with a CHANGELOG for arrow rs 7.0.0.
> 2. A PR[2] with some updates to the readme to clarify versioning
>
> Please take a look if you are interested, or let me know if you need more
> time to review
>
> I think all other outstanding PRs have been merged -- let us know if you
> have something you want merged for the release.
>
> Also, could someone please create a blog post for 7.0.0 (following the
> model for 6.0.0 [3])? It would be neat to highlight some of the new
> features in the 7.0.0 line, and I have some content to add about upcoming
> release cadence and other planned work.
>
> I plan to create a release candidate sometime over this weekend, so if
> voting goes well that means we'll be able to release to crates.io sometime
> around Tuesday the 11th of January.
>
> Thank you
> Andrew
>
> [1]
> https://github.com/apache/arrow-rs/pull/1141#pullrequestreview-846152425
> [2] https://github.com/apache/arrow-rs/pull/1142
> [3] https://github.com/apache/arrow-site/pull/156
>

Re: Preparing for version 7.0.0 release

2022-01-04 Thread Alessandro Molina

Quick note that all "Unassigned" issues that were not already started have
been moved to 8.0.0.
End of next week I'll do another pass and move all "Improvements/New
Features" that are not yet started to 8.0.0

On Tue, Jan 4, 2022 at 10:02 AM Antoine Pitrou  wrote:

>
> Le 03/01/2022 à 15:44, Alessandro Molina a écrit :
> > The plan seems to be to cut a release the 2nd or 3rd week of January, a
> new
> > confluence page was made to track progress of the release (
> > https://cwiki.apache.org/confluence/display/ARROW/Arrow+7.0.0+Release ).
> >
> > It would greatly help in the process of preparing for the release if you
> > could review tickets that are assigned to you in the "TODO Backlog" and
> > move those that you think you will not be able to close in ~1 week to
> > "Version 8.0.0" in Jira, so that we can start preparing release
> > announcements etc with a good estimate of what's actually going to end up
> > in the release.
>
> Note there's also the cpp-7.0.0 version on the Parquet JIRA:
> https://issues.apache.org/jira/projects/PARQUET/versions/12350844
>
> Regards
>
> Antoine.
>

Preparing for version 7.0.0 release

2022-01-03 Thread Alessandro Molina

The plan seems to be to cut a release the 2nd or 3rd week of January, a new
confluence page was made to track progress of the release (
https://cwiki.apache.org/confluence/display/ARROW/Arrow+7.0.0+Release ).

It would greatly help in the process of preparing for the release if you
could review tickets that are assigned to you in the "TODO Backlog" and
move those that you think you will not be able to close in ~1 week to
"Version 8.0.0" in Jira, so that we can start preparing release
announcements etc with a good estimate of what's actually going to end up
in the release.

Thanks everybody for the great work! Lot's of great things are coming in
7.0.0

Re: [VOTE] Release Apache Arrow 6.0.1 - RC1

2021-11-25 Thread Alessandro Molina

For anyone willing to give a final check and merge the PR (
https://github.com/apache/arrow-site/pull/165/files ), I think that the
blog post is good to go and hasn't got any new change in a few days

On Fri, Nov 19, 2021 at 1:35 PM Alessandro Molina <
alessan...@ursacomputing.com> wrote:

> For anyone interested I created the skeleton for the announcement blog
> post at https://github.com/apache/arrow-site/pull/165/files
>
> As it's a fairly small release I'll try to capture the major changes, but
> feel free to add or edit the blog post as you see fit through the usual
> commit suggestions
>
> On Thu, Nov 11, 2021 at 3:39 AM Sutou Kouhei  wrote:
>
>> Hi,
>>
>> I would like to propose the following release candidate (RC1) of Apache
>> Arrow version 6.0.1. This is a release consisting of 29
>> resolved JIRA issues[1].
>>
>> This release candidate is based on commit:
>> 347a88ff9d20e2a4061eec0b455b8ea1aa8335dc [2]
>>
>> The source release rc1 is hosted at [3].
>> The binary artifacts are hosted at [4][5][6][7][8][9][10][11].
>> The changelog is located at [12].
>>
>> Please download, verify checksums and signatures, run the unit tests,
>> and vote on the release. See [13] for how to validate a release candidate.
>>
>> See also verification results by GitHub Actions:
>>
>>   https://github.com/apache/arrow/pull/11671
>>
>> There are some known failures:
>>
>>   * verify-rc-source-integration-linux-amd64
>>   * verify-rc-source-python-macos-arm64
>>   * verify-rc-wheels-macos-11-amd64
>>   * verify-rc-wheels-macos-11-arm64
>>
>> They except verify-rc-source-integration-linux-amd64 are
>> also failed with 6.0.0 RC3:
>>
>>   https://github.com/apache/arrow/pull/11511
>>
>> Here is the verify-rc-source-integration-linux-amd64 log:
>>
>>
>> https://github.com/ursacomputing/crossbow/runs/4172486523?check_suite_focus=true
>>
>> I'm not sure whether this is a blocker or not.
>>
>> Note that the verification passed on my local machine.
>>
>>
>> The vote will be open for at least 72 hours.
>>
>> [ ] +1 Release this as Apache Arrow 6.0.1
>> [ ] +0
>> [ ] -1 Do not release this as Apache Arrow 6.0.1 because...
>>
>> [1]:
>> https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20%28Resolved%2C%20Closed%29%20AND%20fixVersion%20%3D%206.0.1
>> [2]:
>> https://github.com/apache/arrow/tree/347a88ff9d20e2a4061eec0b455b8ea1aa8335dc
>> [3]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-6.0.1-rc1
>> [4]: https://apache.jfrog.io/artifactory/arrow/almalinux-rc/
>> [5]: https://apache.jfrog.io/artifactory/arrow/amazon-linux-rc/
>> [6]: https://apache.jfrog.io/artifactory/arrow/centos-rc/
>> [7]: https://apache.jfrog.io/artifactory/arrow/debian-rc/
>> [8]: https://apache.jfrog.io/artifactory/arrow/java-rc/6.0.1-rc1
>> [9]: https://apache.jfrog.io/artifactory/arrow/nuget-rc/6.0.1-rc1
>> [10]: https://apache.jfrog.io/artifactory/arrow/python-rc/6.0.1-rc1
>> [11]: https://apache.jfrog.io/artifactory/arrow/ubuntu-rc/
>> [12]:
>> https://github.com/apache/arrow/blob/347a88ff9d20e2a4061eec0b455b8ea1aa8335dc/CHANGELOG.md
>> [13]:
>> https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates
>>
>

Re: [VOTE] Release Apache Arrow 6.0.1 - RC1

2021-11-19 Thread Alessandro Molina

For anyone interested I created the skeleton for the announcement blog post
at https://github.com/apache/arrow-site/pull/165/files

As it's a fairly small release I'll try to capture the major changes, but
feel free to add or edit the blog post as you see fit through the usual
commit suggestions

On Thu, Nov 11, 2021 at 3:39 AM Sutou Kouhei  wrote:

> Hi,
>
> I would like to propose the following release candidate (RC1) of Apache
> Arrow version 6.0.1. This is a release consisting of 29
> resolved JIRA issues[1].
>
> This release candidate is based on commit:
> 347a88ff9d20e2a4061eec0b455b8ea1aa8335dc [2]
>
> The source release rc1 is hosted at [3].
> The binary artifacts are hosted at [4][5][6][7][8][9][10][11].
> The changelog is located at [12].
>
> Please download, verify checksums and signatures, run the unit tests,
> and vote on the release. See [13] for how to validate a release candidate.
>
> See also verification results by GitHub Actions:
>
>   https://github.com/apache/arrow/pull/11671
>
> There are some known failures:
>
>   * verify-rc-source-integration-linux-amd64
>   * verify-rc-source-python-macos-arm64
>   * verify-rc-wheels-macos-11-amd64
>   * verify-rc-wheels-macos-11-arm64
>
> They except verify-rc-source-integration-linux-amd64 are
> also failed with 6.0.0 RC3:
>
>   https://github.com/apache/arrow/pull/11511
>
> Here is the verify-rc-source-integration-linux-amd64 log:
>
>
> https://github.com/ursacomputing/crossbow/runs/4172486523?check_suite_focus=true
>
> I'm not sure whether this is a blocker or not.
>
> Note that the verification passed on my local machine.
>
>
> The vote will be open for at least 72 hours.
>
> [ ] +1 Release this as Apache Arrow 6.0.1
> [ ] +0
> [ ] -1 Do not release this as Apache Arrow 6.0.1 because...
>
> [1]:
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20%28Resolved%2C%20Closed%29%20AND%20fixVersion%20%3D%206.0.1
> [2]:
> https://github.com/apache/arrow/tree/347a88ff9d20e2a4061eec0b455b8ea1aa8335dc
> [3]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-6.0.1-rc1
> [4]: https://apache.jfrog.io/artifactory/arrow/almalinux-rc/
> [5]: https://apache.jfrog.io/artifactory/arrow/amazon-linux-rc/
> [6]: https://apache.jfrog.io/artifactory/arrow/centos-rc/
> [7]: https://apache.jfrog.io/artifactory/arrow/debian-rc/
> [8]: https://apache.jfrog.io/artifactory/arrow/java-rc/6.0.1-rc1
> [9]: https://apache.jfrog.io/artifactory/arrow/nuget-rc/6.0.1-rc1
> [10]: https://apache.jfrog.io/artifactory/arrow/python-rc/6.0.1-rc1
> [11]: https://apache.jfrog.io/artifactory/arrow/ubuntu-rc/
> [12]:
> https://github.com/apache/arrow/blob/347a88ff9d20e2a4061eec0b455b8ea1aa8335dc/CHANGELOG.md
> [13]:
> https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates
>

Re: Question about Arrow Mutable/Immutable Arrays choice

2021-11-04 Thread Alessandro Molina

On Wed, Nov 3, 2021 at 11:34 PM Jacques Nadeau  wrote:

> In a perfect world we would have done a better job in the object
> hierarchy/behavior of making this explicit but we don't live in that world,
> unfortunately.

Makes sense, but I thought that was exactly the reason why set/setSafe are
only available for FixedWidth vectors.
On those once the size is set it seems fairly safe to mutate them if the
set methods take care of updating null values too.

So more in general I think that my question was if we should grow mutate
functions in C++ and other bindings too for fixed size arrays or if we
should remove mutate features from Java API and have people deal with
buffers if they want to mutate things (so that's more explicit that you are
messing with internals) so that we have a consistent experience across
bindings.

Question about Arrow Mutable/Immutable Arrays choice

2021-11-03 Thread Alessandro Molina

I recently noticed that in the Java implementation we expose a set/setSafe
function that allows to mutate Arrow Arrays [1]

This seems to be at odds with the general design of the C++ (and by
consequence Python and R) library where Arrays are immutable and can be
modified only through compute functions returning copies.

The Arrow Format documentation [2] seems to suggest that mutation of data
structures is possible and left as an implementation detail, but given that
some users might be willing to mutate existing structures (for example to
avoid incurring in the memory cost of copies when dealing with big arrays)
I think there might be reasons for both allowing mutation of Arrays and
disallowing it. It probably makes sense to ensure that all the
implementations agree on such a fundamental choice to avoid setting
expectations on users' side that might not apply when they cross language
barriers.

[1]
https://arrow.apache.org/docs/java/reference/org/apache/arrow/vector/SmallIntVector.html#setSafe-int-int-
[2] https://arrow.apache.org/docs/format/Columnar.html

Re: [VOTE] Release Apache Arrow 6.0.0 - RC3

2021-10-22 Thread Alessandro Molina

+1 (non binding)

Verified on Mac OS 10.14 x86

Checked
dev/release/verify-release-candidate.sh binaries 6.0.0 3
dev/release/verify-release-candidate.sh wheels 6.0.0 3

Only notice, I got a "OSError: [Errno 24] Too many open files" error
initially and had to raise limit over open files. I don't know if that's
expected or something changed recently.

On Fri, Oct 22, 2021 at 1:31 AM Krisztián Szűcs 
wrote:

> Hi,
>
> I would like to propose the following release candidate (RC3) of Apache
> Arrow version 6.0.0. This is a release consisting of 592
> resolved JIRA issues[1].
>
> This release candidate is based on commit:
> 5a5f4ce326194750422ef6f053469ed1912ce69f [2]
>
> The source release rc3 is hosted at [3].
> The binary artifacts are hosted at [4][5][6][7][8][9].
> The changelog is located at [10].
>
> Please download, verify checksums and signatures, run the unit tests,
> and vote on the release. See [11] for how to validate a release candidate.
>
> The vote will be open for at least 72 hours.
>
> [ ] +1 Release this as Apache Arrow 6.0.0
> [ ] +0
> [ ] -1 Do not release this as Apache Arrow 6.0.0 because...
>
> [1]:
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20%28Resolved%2C%20Closed%29%20AND%20fixVersion%20%3D%206.0.0
> [2]:
> https://github.com/apache/arrow/tree/5a5f4ce326194750422ef6f053469ed1912ce69f
> [3]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-6.0.0-rc3
> [4]: https://apache.jfrog.io/artifactory/arrow/amazon-linux-rc/
> [5]: https://apache.jfrog.io/artifactory/arrow/centos-rc/
> [6]: https://apache.jfrog.io/artifactory/arrow/debian-rc/
> [7]: https://apache.jfrog.io/artifactory/arrow/nuget-rc/6.0.0-rc3
> [8]: https://apache.jfrog.io/artifactory/arrow/python-rc/6.0.0-rc3
> [9]: https://apache.jfrog.io/artifactory/arrow/ubuntu-rc/
> [10]:
> https://github.com/apache/arrow/blob/5a5f4ce326194750422ef6f053469ed1912ce69f/CHANGELOG.md
> [11]:
> https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates
>

Re: Preparing for release 6.0.0

2021-10-14 Thread Alessandro Molina

FYI, I also created the skeleton for the release blog post at
https://github.com/apache/arrow-site/pull/153/files
This should give everyone time to fill the proper parts of the blog post
using github commit suggestions

Ignore the blog post date, I just put any random future date, which will
have to be updated when we actually publish the release.

On Thu, Oct 14, 2021 at 10:24 AM Alessandro Molina <
alessan...@ursacomputing.com> wrote:

> Seems the tentative release date will probably slip to Monday/Tuesday next
> week. There has been some delay generated by the release of Python3.10 in
> the hope we could already include support for it in the current release.
>
> There are a total of 30+ started issues as the moment, it would be great
> if the owners could defer to v7.0.0 those that they don't think can close
> in time for Monday
>
> On Mon, Oct 4, 2021 at 1:38 PM Krisztián Szűcs 
> wrote:
>
>> Aiming the first release candidate for Oct 14th/15th sounds good to me.
>>
>> On Mon, Oct 4, 2021 at 10:35 AM Alessandro Molina
>>  wrote:
>> >
>> > If possible I think it makes sense to aim for the week of 11th,
>> > if there is any blocker or major issues that gets raised and for which
>> we
>> > need to wait we can defer to the week of 18th.
>> >
>> > As far as it's 2nd/3rd week of the month I don't think it makes any
>> major
>> > difference.
>> > Obviously when Krisztian will be able to cut the release plays the
>> biggest
>> > role here, so I'm willing to hear his opinion about what we can aim for.
>> >
>> > On Mon, Oct 4, 2021 at 3:48 AM Micah Kornfield 
>> > wrote:
>> >
>> > > Hi
>> > > I Just wanted to clarify we are aiming at cutting the first release
>> > > candidate for the main report the week of the 11th or 18th?
>> > >
>> > > Thanks,
>> > > Micah
>> > >
>> > > On Sat, Oct 2, 2021 at 3:27 AM Andrew Lamb 
>> wrote:
>> > >
>> > > > FYI as we did with the arrow-rs 5.0 release, I will prepare an
>> arrow-rs
>> > > > 6.0.0 release approximately concurrently with the other languages.
>> > > >
>> > > > I will tentatively aiim to create an arrow-rs 6.0 candidate on
>> October 14
>> > > > or October 15 (assuming it is approved, it would be released on or
>> around
>> > > > October 18, 2021).
>> > > >
>> > > > Please let me know if there are any concerns with this schedule
>> > > > Andrew
>> > > >
>> > > > On Fri, Oct 1, 2021 at 3:34 AM Alessandro Molina <
>> > > > alessan...@ursacomputing.com> wrote:
>> > > >
>> > > > > In preparation for release 6.0.0 which should probably happen
>> within
>> > > the
>> > > > > next 2-3 weeks according to the usual release cycle the
>> Confluence page
>> > > > for
>> > > > > the release has been created (
>> > > > >
>> https://cwiki.apache.org/confluence/display/ARROW/Arrow+6.0.0+Release
>> > > )
>> > > > >
>> > > > > Also all non Bug issues that were not started have been moved to
>> > > version
>> > > > > 7.0.0, please do not create new issues under 6.0.0 unless you
>> expect to
>> > > > be
>> > > > > able to solve them within the next few days. Likewise, feel free
>> to
>> > > move
>> > > > > back to version 6.0.0 things that you expect to be able to address
>> > > within
>> > > > > the next few days.
>> > > > >
>> > > > > That leaves the release with a total of ~60 bugs pending. It's
>> probably
>> > > > > unrealistic that those can be fixed in 2-3 weeks, so I encourage
>> the
>> > > > owners
>> > > > > of those issues to go through them and postpone them to version
>> 7.0.0
>> > > > > unless they plan to address them soon. The remaining ones will be
>> moved
>> > > > to
>> > > > > version 7.0.0 before the release.
>> > > > >
>> > > > > Thanks everybody for the big effort in contributing to a release
>> that
>> > > > > includes nearly 500 issues and if you are aware of any blocker
>> (apart
>> > > > those
>> > > > > already marked as blockers in the Confluence page) please raise
>> them
>> > > > early
>> > > > > on so that there is enough time to address them.
>> > > > >
>> > > >
>> > >
>>
>

Re: Preparing for release 6.0.0

2021-10-14 Thread Alessandro Molina

Seems the tentative release date will probably slip to Monday/Tuesday next
week. There has been some delay generated by the release of Python3.10 in
the hope we could already include support for it in the current release.

There are a total of 30+ started issues as the moment, it would be great if
the owners could defer to v7.0.0 those that they don't think can close in
time for Monday

On Mon, Oct 4, 2021 at 1:38 PM Krisztián Szűcs 
wrote:

> Aiming the first release candidate for Oct 14th/15th sounds good to me.
>
> On Mon, Oct 4, 2021 at 10:35 AM Alessandro Molina
>  wrote:
> >
> > If possible I think it makes sense to aim for the week of 11th,
> > if there is any blocker or major issues that gets raised and for which we
> > need to wait we can defer to the week of 18th.
> >
> > As far as it's 2nd/3rd week of the month I don't think it makes any major
> > difference.
> > Obviously when Krisztian will be able to cut the release plays the
> biggest
> > role here, so I'm willing to hear his opinion about what we can aim for.
> >
> > On Mon, Oct 4, 2021 at 3:48 AM Micah Kornfield 
> > wrote:
> >
> > > Hi
> > > I Just wanted to clarify we are aiming at cutting the first release
> > > candidate for the main report the week of the 11th or 18th?
> > >
> > > Thanks,
> > > Micah
> > >
> > > On Sat, Oct 2, 2021 at 3:27 AM Andrew Lamb 
> wrote:
> > >
> > > > FYI as we did with the arrow-rs 5.0 release, I will prepare an
> arrow-rs
> > > > 6.0.0 release approximately concurrently with the other languages.
> > > >
> > > > I will tentatively aiim to create an arrow-rs 6.0 candidate on
> October 14
> > > > or October 15 (assuming it is approved, it would be released on or
> around
> > > > October 18, 2021).
> > > >
> > > > Please let me know if there are any concerns with this schedule
> > > > Andrew
> > > >
> > > > On Fri, Oct 1, 2021 at 3:34 AM Alessandro Molina <
> > > > alessan...@ursacomputing.com> wrote:
> > > >
> > > > > In preparation for release 6.0.0 which should probably happen
> within
> > > the
> > > > > next 2-3 weeks according to the usual release cycle the Confluence
> page
> > > > for
> > > > > the release has been created (
> > > > >
> https://cwiki.apache.org/confluence/display/ARROW/Arrow+6.0.0+Release
> > > )
> > > > >
> > > > > Also all non Bug issues that were not started have been moved to
> > > version
> > > > > 7.0.0, please do not create new issues under 6.0.0 unless you
> expect to
> > > > be
> > > > > able to solve them within the next few days. Likewise, feel free to
> > > move
> > > > > back to version 6.0.0 things that you expect to be able to address
> > > within
> > > > > the next few days.
> > > > >
> > > > > That leaves the release with a total of ~60 bugs pending. It's
> probably
> > > > > unrealistic that those can be fixed in 2-3 weeks, so I encourage
> the
> > > > owners
> > > > > of those issues to go through them and postpone them to version
> 7.0.0
> > > > > unless they plan to address them soon. The remaining ones will be
> moved
> > > > to
> > > > > version 7.0.0 before the release.
> > > > >
> > > > > Thanks everybody for the big effort in contributing to a release
> that
> > > > > includes nearly 500 issues and if you are aware of any blocker
> (apart
> > > > those
> > > > > already marked as blockers in the Confluence page) please raise
> them
> > > > early
> > > > > on so that there is enough time to address them.
> > > > >
> > > >
> > >
>

Re: Preparing for release 6.0.0

2021-10-04 Thread Alessandro Molina

If possible I think it makes sense to aim for the week of 11th,
if there is any blocker or major issues that gets raised and for which we
need to wait we can defer to the week of 18th.

As far as it's 2nd/3rd week of the month I don't think it makes any major
difference.
Obviously when Krisztian will be able to cut the release plays the biggest
role here, so I'm willing to hear his opinion about what we can aim for.

On Mon, Oct 4, 2021 at 3:48 AM Micah Kornfield 
wrote:

> Hi
> I Just wanted to clarify we are aiming at cutting the first release
> candidate for the main report the week of the 11th or 18th?
>
> Thanks,
> Micah
>
> On Sat, Oct 2, 2021 at 3:27 AM Andrew Lamb  wrote:
>
> > FYI as we did with the arrow-rs 5.0 release, I will prepare an arrow-rs
> > 6.0.0 release approximately concurrently with the other languages.
> >
> > I will tentatively aiim to create an arrow-rs 6.0 candidate on October 14
> > or October 15 (assuming it is approved, it would be released on or around
> > October 18, 2021).
> >
> > Please let me know if there are any concerns with this schedule
> > Andrew
> >
> > On Fri, Oct 1, 2021 at 3:34 AM Alessandro Molina <
> > alessan...@ursacomputing.com> wrote:
> >
> > > In preparation for release 6.0.0 which should probably happen within
> the
> > > next 2-3 weeks according to the usual release cycle the Confluence page
> > for
> > > the release has been created (
> > > https://cwiki.apache.org/confluence/display/ARROW/Arrow+6.0.0+Release
> )
> > >
> > > Also all non Bug issues that were not started have been moved to
> version
> > > 7.0.0, please do not create new issues under 6.0.0 unless you expect to
> > be
> > > able to solve them within the next few days. Likewise, feel free to
> move
> > > back to version 6.0.0 things that you expect to be able to address
> within
> > > the next few days.
> > >
> > > That leaves the release with a total of ~60 bugs pending. It's probably
> > > unrealistic that those can be fixed in 2-3 weeks, so I encourage the
> > owners
> > > of those issues to go through them and postpone them to version 7.0.0
> > > unless they plan to address them soon. The remaining ones will be moved
> > to
> > > version 7.0.0 before the release.
> > >
> > > Thanks everybody for the big effort in contributing to a release that
> > > includes nearly 500 issues and if you are aware of any blocker (apart
> > those
> > > already marked as blockers in the Confluence page) please raise them
> > early
> > > on so that there is enough time to address them.
> > >
> >
>

Preparing for release 6.0.0

2021-10-01 Thread Alessandro Molina

In preparation for release 6.0.0 which should probably happen within the
next 2-3 weeks according to the usual release cycle the Confluence page for
the release has been created (
https://cwiki.apache.org/confluence/display/ARROW/Arrow+6.0.0+Release )

Also all non Bug issues that were not started have been moved to version
7.0.0, please do not create new issues under 6.0.0 unless you expect to be
able to solve them within the next few days. Likewise, feel free to move
back to version 6.0.0 things that you expect to be able to address within
the next few days.

That leaves the release with a total of ~60 bugs pending. It's probably
unrealistic that those can be fixed in 2-3 weeks, so I encourage the owners
of those issues to go through them and postpone them to version 7.0.0
unless they plan to address them soon. The remaining ones will be moved to
version 7.0.0 before the release.

Thanks everybody for the big effort in contributing to a release that
includes nearly 500 issues and if you are aware of any blocker (apart those
already marked as blockers in the Confluence page) please raise them early
on so that there is enough time to address them.

Re: [DISCUSS][Python] Public Cython API

2021-08-25 Thread Alessandro Molina

Given we didn't get much opinions on this one, I will propose we move
forward with merging the open PR that moves ipc cython implementation and
discover if we receive any open issue because projects out there were
relying on it.
It seems that ipc is a low risk module from that point of view and will at
least reduce the surface of `pyarrow.lib` making easier to reason about
what should be public or internal in the future.

If we get users complaining that they were using ipc from Cython we can
think how to expose it properly instead of exposing it by chance as a side
effect of using includes in Cython

On Fri, Aug 20, 2021 at 12:24 PM Alessandro Molina <
alessan...@ursacomputing.com> wrote:

> While working on https://github.com/apache/arrow/pull/10162 it was raised
> the concern that it's hard to change Cython code because it might break
> third party libraries and projects relying on pyarrow through Cython.
>
> Mostly the problem comes from the fact that the documentation suggests
> pyarrow.lib.* (
> https://arrow.apache.org/docs/python/extending.html#example ) as what
> should be used to import features from pyarrow in Cython.
> Given most of pyarrow is implemented including pxi files into the lib.pyx
> module (
> https://github.com/apache/arrow/blob/master/python/pyarrow/lib.pyx#L118-L163
> ) it means that we are exposing the majority of the internals as our public
> api.
>
> The consequence is that we in practice are preventing ourselves from
> touching anything that exists in those included files as they might have
> been used by another project and thus they can't be moved or change their
> signature.
>
> We could argue that only what was documented explicitly should be
> considered "public" and everything else can be changed, but our
> documentation seems to be unclear on this point. It lists some functions
> that should be considered our explicit api (
> https://arrow.apache.org/docs/python/extending.html#cython-api ) but then
> uses CArray  in the example (
> https://arrow.apache.org/docs/python/extending.html#example ) which
> wasn't listed as public.
>
> I think it would be helpful to come to an agreement about what we should
> consider publicly exposed from Cython so that we can properly update
> documentation and unblock possible refactoring.
>
> Personally, even at risk of breaking third parties code, I think it would
> be wise to aim for the minimum exposed surface. I'd consider Cython mostly
> an implementation detail and promote usage of libarrow from C/C++ directly
> if you need to work on high performance Python extensions.
>

[DISCUSS][Python] Public Cython API

2021-08-20 Thread Alessandro Molina

While working on https://github.com/apache/arrow/pull/10162 it was raised
the concern that it's hard to change Cython code because it might break
third party libraries and projects relying on pyarrow through Cython.

Mostly the problem comes from the fact that the documentation suggests
pyarrow.lib.* ( https://arrow.apache.org/docs/python/extending.html#example
) as what should be used to import features from pyarrow in Cython.
Given most of pyarrow is implemented including pxi files into the lib.pyx
module (
https://github.com/apache/arrow/blob/master/python/pyarrow/lib.pyx#L118-L163
) it means that we are exposing the majority of the internals as our public
api.

The consequence is that we in practice are preventing ourselves from
touching anything that exists in those included files as they might have
been used by another project and thus they can't be moved or change their
signature.

We could argue that only what was documented explicitly should be
considered "public" and everything else can be changed, but our
documentation seems to be unclear on this point. It lists some functions
that should be considered our explicit api (
https://arrow.apache.org/docs/python/extending.html#cython-api ) but then
uses CArray  in the example (
https://arrow.apache.org/docs/python/extending.html#example ) which wasn't
listed as public.

I think it would be helpful to come to an agreement about what we should
consider publicly exposed from Cython so that we can properly update
documentation and unblock possible refactoring.

Personally, even at risk of breaking third parties code, I think it would
be wise to aim for the minimum exposed surface. I'd consider Cython mostly
an implementation detail and promote usage of libarrow from C/C++ directly
if you need to work on high performance Python extensions.

Re: [DISCUSS][Python] Making NumPy optional dependency?

2021-08-17 Thread Alessandro Molina

I did a quick and dirty experiment and what I got was a segmentation fault,
but I guess that a lot depends on what of the things you are using was
inlined at compile time.
I could get through the segfault by using PyImport_ImportModule to check if
numpy exists and got a working minimal case where I could import pyarrow
and create an array out of a python list. Just an hack far from being a
decent solution or test, but I think that the most intertwined places are
already behind a PyArray_Check and thus we could use that as a guard to
avoid execution of numpy related code. It looks like the majority of the
work would actually be in Cython, where by the way, how to deal with
unavailable imports is much straightforward.

There are by the way some interesting points, like the fact that the mask
for a pyarrow array can only be a numpy array, how could I create a masked
array without numpy? I guess that accepting arrow arrays as mask is
actually something we should allow anyway.

On Mon, Aug 16, 2021 at 6:53 PM Antoine Pitrou  wrote:

>
> I agree that "what happens when Numpy is not available at runtime" is a
> rather annoying problem.  I'm not sure what happens when you call one
> of the Numpy C API functions and Numpy is not found (crash? error
> return?).  It can probably be detected, but needs to be done
> consistently at the start of each PyArrow core function, which requires
> some care.
>
> At the end of the day, it looks like this would be a significant amount
> of work for a relatively minor benefit (did people complain about
> this?), so I'm not sure it's worth spending some time on it.
>
> Regards
>
> Antoine.
>
>
>
> On Mon, 16 Aug 2021 18:09:54 +0200
> Wes McKinney  wrote:
> > I've thought about this in the past, and I would like to make NumPy an
> > optional dependency, but one of the things that kept me from trying
> > was the extent to which NumPy arrays are supported as inputs (or
> > elements of inputs) to pyarrow.array. The implementation in
> > python_to_arrow.cc is significantly intertwined with NumPy's C API. It
> > might require maintaining two altogether different internal
> > implementations of pyarrow.array, a complicated one which deals with
> > all the NumPy oddities (including NumPy array scalars) and a much
> > simpler one that does not. pyarrow may have to detect at runtime
> > whether numpy is in sys.modules to decide whether to import and invoke
> > the more complicated function.
> >
> > On Mon, Aug 16, 2021 at 5:59 PM Alessandro Molina
> >  wrote:
> > >
> > > As Arrow/PyArrow grows more compute functions and features we might
> move
> > > toward a world where the number of users relying on PyArrow without
> going
> > > through Pandas or NumPy might grow.
> > >
> > > NumPy is a compile time dependency for PyArrow as it's required to
> compile
> > > the C++ code needed to implement the pandas/numpy integration, but
> there
> > > has been some discussion regard the fact that we could make NumPy
> optional
> > > at runtime (remove it from required dependencies in the Python
> > > distribution). You would have to install numpy only if you need to
> invoke
> > > to_numpy or to_pandas methods or similar integration features. For all
> the
> > > other use cases, that rely on Arrow alone, you would be able to pip
> install
> > > pyarrow without involving any other dependency and be ready to go.
> > >
> > > Technically it seems a bit complicated, Python/Cython can always work
> > > around missing libraries, but we would have to find ways to deal with
> lazy
> > > involvement of numpy from C++. I don't know if this is something that
> was
> > > already discussed in the past and thus someone already has solutions
> for
> > > this part of the problem, but before investing time and effort in
> research
> > > I think it made sense to make sure it's a goal that the development
> team
> > > agrees with.
> >
>
>
>
>

[DISCUSS][Python] Making NumPy optional dependency?

2021-08-16 Thread Alessandro Molina

As Arrow/PyArrow grows more compute functions and features we might move
toward a world where the number of users relying on PyArrow without going
through Pandas or NumPy might grow.

NumPy is a compile time dependency for PyArrow as it's required to compile
the C++ code needed to implement the pandas/numpy integration, but there
has been some discussion regard the fact that we could make NumPy optional
at runtime (remove it from required dependencies in the Python
distribution). You would have to install numpy only if you need to invoke
to_numpy or to_pandas methods or similar integration features. For all the
other use cases, that rely on Arrow alone, you would be able to pip install
pyarrow without involving any other dependency and be ready to go.

Technically it seems a bit complicated, Python/Cython can always work
around missing libraries, but we would have to find ways to deal with lazy
involvement of numpy from C++. I don't know if this is something that was
already discussed in the past and thus someone already has solutions for
this part of the problem, but before investing time and effort in research
I think it made sense to make sure it's a goal that the development team
agrees with.

[DISCUSS][Python] Moving Python specific code into pyarrow

2021-08-16 Thread Alessandro Molina

PyArrow is currently full Cython codebase, but in reality it relies on some
classes and functions that are implemented in C++ within the src/python
directory ( https://github.com/apache/arrow/tree/master/cpp/src/arrow/python
). Especially for numpy/pandas conversion code that has to interface with
Numpy arrays data at low level.

When working in the area of PyArrow it's not uncommon that you end up
jumping back and forth between the Arrow C++ codebase for Python and
PyArrow and you can also end up with, sometimes hard to catch, integration
issues if you forgot to recompile libarrow even if you are working on a
Python only change.

I'm wondering if it wouldn't make life easier for contributors if the
src/arrow/python directory was moved into pyarrow and we made PyArrow able
to build it.

That would probably reduce risk of integration issues as rebuilding pyarrow
alone would probably be enough for most python specific changes (as it
would also rebuild the Python specific C++).

I think that moving src/arrow/python into pyarrow would also make the
codebase more cohesive which would lower the barrier for new contributors
looking for how to fix a pyarrow specific issue.

Unless there is any major side effect (outside of having to build a bit
more complex build scripts for pyarrow, but it's already CMake based, so
building some C++ shouldn't be a big deal) that I'm missing, it seems that
the benefits of having all Python related code into a single place would
surpass the side effects.

Also I'm not sure how widespread it is the requirement of Python from C++,
but it seems to me that if we moved all Python specific code into pyarrow
we could make libarrow decoupled from Python. Which might make it easier to
deal with Virtualenvs or debug versions of python as you wouldn't have to
deal with Python3_EXECUTABLE etc when building libarrow.

Any thoughts?

Re: Apache Arrow Cookbook

2021-07-28 Thread Alessandro Molina

Hi everybody,

The Cookbook PR has been open for more than a week at this point and we
have received tons of great feedback and suggestions, many of which we
incorporated already.
For the benefit of being able to verify the publishing workflow and the CI
I'd love to ask if there is anyone who could merge the PR (unless there are
major blockers) as it's an apache repository and thus requires explicit
permissions.
So we can start verifying that the build process we put in place leads to
the expected results and maybe add a link to the Cookbook from the Arrow
Documentation before the new documentation gets deployed for 5.0.0

On Tue, Jul 20, 2021 at 12:24 PM Alessandro Molina <
alessan...@ursacomputing.com> wrote:

> The Pull Request for the Cookbook has been created (
> https://github.com/apache/arrow-cookbook/pull/1 )
> I left as comments in the PR the steps that need to be done to enable
> compilation of the cookbook once the PR is merged (enabling actions, gh
> pages etc...) anyone willing to merge it should probably also take care of
> those few steps so that we can make sure that all pieces are in place.
> Thanks!
>
> On Wed, Jul 14, 2021 at 11:43 PM Wes McKinney  wrote:
>
>> I just initialized
>>
>> https://github.com/apache/arrow-cookbook
>>
>> On Wed, Jul 14, 2021 at 1:33 PM Wes McKinney  wrote:
>> >
>> > On Wed, Jul 14, 2021 at 8:33 AM Alessandro Molina
>> >  wrote:
>> > >
>> > > On Tue, Jul 13, 2021 at 2:40 PM Wes McKinney 
>> wrote:
>> > >
>> > > > I requested its creation here
>> > > >
>> > > > https://github.com/apache/arrow-cookbook
>> > > >
>> > > > If you can set up a PR into this repo (not sure if I need to push an
>> > > > empty "initial commit" repo, but let me know),
>> > >
>> > >
>> > > Seems your concern was correct, you can't open PRs against an empty
>> > > repository.
>> > > If you could make an initial commit that would be great.
>> >
>> > OK, will do.
>> >
>> > >
>> > > > please make sure
>> > > > everyone who has contributed has an ICLA on file with the ASF
>> > > > secretary. I'm not sure that it's necessary for us to conduct an IP
>> > > > clearance but others can comment if they disagree.
>> > > >
>> > >
>> > > I guess that http://people.apache.org/phonebook.html only covers a
>> list of
>> > > committers and PMC members, not general contributors.
>> > > Is there any way to check if any contributor has already signed an
>> ICLA?
>> > > Also, for my general understanding, should we ask to sign the ICLA
>> before
>> > > accepting/merging PRs or is it acceptable to merge PRs from occasional
>> > > contributions even in absence of a signed ICLA?
>> >
>> > Since https://github.com/ursacomputing/arrow-cookbook is "outside the
>> > Arrow community", having the contributors to this repository sign
>> > ICLAs would be a good practice before moving the code to an Apache
>> > repository. Since this codebase isn't very old and we probably won't
>> > be making official ASF releases of this project, the formal IP
>> > clearance process is likely not necessary.
>> >
>> > We don't need ICLAs from normal contributors into Apache repositories.
>>
>

Re: Apache Arrow Cookbook

2021-07-20 Thread Alessandro Molina

The Pull Request for the Cookbook has been created (
https://github.com/apache/arrow-cookbook/pull/1 )
I left as comments in the PR the steps that need to be done to enable
compilation of the cookbook once the PR is merged (enabling actions, gh
pages etc...) anyone willing to merge it should probably also take care of
those few steps so that we can make sure that all pieces are in place.
Thanks!

On Wed, Jul 14, 2021 at 11:43 PM Wes McKinney  wrote:

> I just initialized
>
> https://github.com/apache/arrow-cookbook
>
> On Wed, Jul 14, 2021 at 1:33 PM Wes McKinney  wrote:
> >
> > On Wed, Jul 14, 2021 at 8:33 AM Alessandro Molina
> >  wrote:
> > >
> > > On Tue, Jul 13, 2021 at 2:40 PM Wes McKinney 
> wrote:
> > >
> > > > I requested its creation here
> > > >
> > > > https://github.com/apache/arrow-cookbook
> > > >
> > > > If you can set up a PR into this repo (not sure if I need to push an
> > > > empty "initial commit" repo, but let me know),
> > >
> > >
> > > Seems your concern was correct, you can't open PRs against an empty
> > > repository.
> > > If you could make an initial commit that would be great.
> >
> > OK, will do.
> >
> > >
> > > > please make sure
> > > > everyone who has contributed has an ICLA on file with the ASF
> > > > secretary. I'm not sure that it's necessary for us to conduct an IP
> > > > clearance but others can comment if they disagree.
> > > >
> > >
> > > I guess that http://people.apache.org/phonebook.html only covers a
> list of
> > > committers and PMC members, not general contributors.
> > > Is there any way to check if any contributor has already signed an
> ICLA?
> > > Also, for my general understanding, should we ask to sign the ICLA
> before
> > > accepting/merging PRs or is it acceptable to merge PRs from occasional
> > > contributions even in absence of a signed ICLA?
> >
> > Since https://github.com/ursacomputing/arrow-cookbook is "outside the
> > Arrow community", having the contributors to this repository sign
> > ICLAs would be a good practice before moving the code to an Apache
> > repository. Since this codebase isn't very old and we probably won't
> > be making official ASF releases of this project, the formal IP
> > clearance process is likely not necessary.
> >
> > We don't need ICLAs from normal contributors into Apache repositories.
>

Re: Apache Arrow Cookbook

2021-07-14 Thread Alessandro Molina

On Tue, Jul 13, 2021 at 2:40 PM Wes McKinney  wrote:

> I requested its creation here
>
> https://github.com/apache/arrow-cookbook
>
> If you can set up a PR into this repo (not sure if I need to push an
> empty "initial commit" repo, but let me know),

Seems your concern was correct, you can't open PRs against an empty
repository.
If you could make an initial commit that would be great.

> please make sure
> everyone who has contributed has an ICLA on file with the ASF
> secretary. I'm not sure that it's necessary for us to conduct an IP
> clearance but others can comment if they disagree.
>

I guess that http://people.apache.org/phonebook.html only covers a list of
committers and PMC members, not general contributors.
Is there any way to check if any contributor has already signed an ICLA?
Also, for my general understanding, should we ask to sign the ICLA before
accepting/merging PRs or is it acceptable to merge PRs from occasional
contributions even in absence of a signed ICLA?

Re: [DISCUSS] Should we start marking "feather" as deprecated?

2021-07-14 Thread Alessandro Molina

I think from users point of view it would be helpful to have only one
clearly documented glossary and way to do things.
At the moment, at least for the Python documentation, is not very clear
what's the difference between feather and ipc.new_file
Deprecating the Feather terminology would surely solve this problem, but
even if we don't end up deprecating it I think we should make more clear
what users are expected to rely on as otherwise there is the risk is of
building a product that competes with itself and ends up creating confusion
in users.

Re: [DISCUSS] What is the Plasma status currently?

2021-07-14 Thread Alessandro Molina

I was wondering, for the benefit of lowering the entry barrier for users
and especially future contributions who might find themselves confused by
the amount of optional pieces that you can pick when building arrow, would
it be reasonable to think of shipping plasma as a separate library? Like
arrow-plasma with its own packaging/release cycle? That would also have the
benefit of giving us a better understanding of how many people are actually
depending on it based on how many people depend on that package.

It's true that there would be an initial burden in separating the codebase
and building its own CI/release scripts, but I think it would ease life for
people willing to contribute on arrow "ignoring" plasma and it would give
plasma the chance to maybe get maintenance outside of the arrow developers
from people who might not care about contributing to arrow itself at the
moment.

On Tue, Jul 13, 2021 at 2:18 PM Neal Richardson 
wrote:

> Hi Jarek,
> Your understanding sounds about right to me. That said, we are still
> building and shipping Plasma for those that have come to depend on it and
> will continue to do so unless/until it becomes a maintenance burden. But no
> one active in the Arrow community is working on Plasma.
>
> Neal
>
> On Tue, Jul 13, 2021 at 3:07 AM Jarek Potiuk  wrote:
>
> > Hello Arrow Community,
> >
> > We've had a very interesting talk at the Apache Airflow Summit about
> > Airflow + Ray (which is really cool BTW and I am looking forward to
> > capabilities it will give to Airflow) and we had some discussions that
> > followed. From what I understand (maybe I am wrong?) the Plasma which was
> > initially developed in Ray, then contributed to Arrow, and then (
> >
> https://lists.apache.org/thread.html/r65b2852e4cddb1af8bff06d789bf3822d6c5dfcd481414acd3d7%40%3Cdev.arrow.apache.org%3E
> )
> > forked (?) by Ray and is kind-of abandoned in Arrow and not really
> > maintained in Arrow any more (and likely Ray version and Arrow version
> are
> > not compatible /exchangeable).
> >
> > Is this correct understanding ? Any more comments or maybe explanation
> > what is the relation between Arrow's Plasma and Ray's Plasma?
> >
> > Just to explain my interest -  I am a PMC of Apache Airflow, I am an
> > independent Open-Source contributor and advisor, and I am genuinely
> > interested in Open-source business models and rationale of stakeholders
> and
> > how this plays out with individuals and the ASF/PMC and I wanted to
> > understand the current state of Plasma :)
> >
> > J.
> >
> >
>

Re: Apache Arrow Cookbook

2021-07-13 Thread Alessandro Molina

How should we move forward to "request" an arrow-cookbook repository under
the apache organization? Is there a form or request that has to be
submitted?
Another thing we were wondering, is that being able to deal with
contributions using GitHub Issues would lower the barrier for users who
find issues and want to report them, once the repository is moved under the
Apache organization will we be able to keep using GitHub issues like the
rust projects are doing or should we enforce usage of JIRA for reporting
issues?

On Fri, Jul 9, 2021 at 5:59 PM Wes McKinney  wrote:

> Some benefits of separating the cookbook from the documentation would
> be to decouple its release / publication from Arrow releases, so you
> can roll out new content to the published version as soon as it's
> merged into the repository, where in the same fashion we might not
> want to publish inter-release changes to the documentation. You could
> also have a separate entry point to increase navigability (since the
> documentation is intended to be more of a reference book).
>
> Given that the Rust projects have decoupled into multiple
> repositories, a "cookbook" repository could also be a place to collect
> recipes related to DataFusion.
>
> Either option is plenty reasonable, though, so feel free to choose
> what makes the most sense to you.
>
> On Thu, Jul 8, 2021 at 12:09 PM Alessandro Molina
>  wrote:
> >
> > Thinking about it, I think that having the cookbook into its own
> repository
> > (apache/arrow-cookbook) might lower the barrier for contributors. You
> only
> > need to clone the cookbook and running `make` does also take care of
> > installing the required dependencies, so in theory you don't even need to
> > care too much about setting up your environment. But we can surely
> improve
> > the README in the repo further to ease contributions.
> >
> > I think we can also preserve the benefit that Nic mentioned of making
> sure
> > that on each Arrow build the recipes are verified by triggering a build
> of
> > the cookbook repository on each new arrow master change. Worst case,
> have a
> > nightly build for the cookbook that clones that latest arrow master
> branch.
> >
> > Having a cookbook for C++ is a very good idea, that might be the next
> step
> > once we finish the Python and R versions. If people want to contribute
> > cookbook versions for more languages that would be greatly appreciated
> too.
> >
> > On the other hand, while we want to keep the cookbooks in the same
> > repository and sharing the same infrastructure to keep a low entry
> barrier
> > (make py/r/X will just compile the cookbook for the language you
> picked), I
> > feel that keeping the cookbook separated per language is a good idea.
> While
> > it's cool to be able to compare the solution between languages, in
> general
> > developers look for the solution in their target language and might
> > perceive as noise the other implementations.
> > For example, we received similar feedback for the Arrow documentation
> too,
> > that as a Python developer it's hard to find what you are looking for
> > because it's mixed with the "format" and "C++" documentation and there
> are
> > a few links back and forth between them.
> >
> >
> >
> >
> >
> > On Thu, Jul 8, 2021 at 11:39 AM Nic  wrote:
> >
> > > One of the possible aims for the cookbook is having interlinked
> > > documentation between function docs and the cookbook, and both the R
> and
> > > Python docs include tests that all of the outputs are expected.
> Including
> > > these tests means that we can immediately see if any code changes
> render
> > > any recipes incorrect.  Therefore the decoupling between cookbook
> updates
> > > and docs updates may not be necessary.
> > >
> > > That said, there has been mention of having versions of the cookbook
> tied
> > > to released versions of Arrow, which sounds like a great idea.
> > >
> > > The repo also includes a Makefile which creates all the relevant
> setup, so
> > > hopefully that should simplify things for users.  The R cookbook uses
> > > bookdown, which has a feature where a reader can click an 'edit'
> button and
> > > it automatically creates a fork where they can edit the cookbook and
> submit
> > > a PR directly from GitHub.
> > >
> > > It'd be great to see a lot of recipes in multiple languages, but in the
> > > document of possible recipes circulated previously, we identified
> slightly
> > > different n

5.0.0 Release and Release Manager

2021-07-08 Thread Alessandro Molina

As mentioned in the biweekly sync call, we are approaching the wished date
for the 5.0.0 release, which should happen at the end of next week, or
worst case the week after.

Apart from my usual recommendation to take a look at the TODO Backlog at
https://cwiki.apache.org/confluence/display/ARROW/Arrow+5.0.0+Release and
defer to 6.0.0 any ticket you don't think you will be able to tackle in the
next week or two, I think that we also need to start raising the topic of
who is willing to be the release manager for this round.

At the biweekly call Jorge already mentioned that he might not be able to
take care of this release like he did for 4.0.1, so if anyone one else is
willing to take over the process I think it would be great to start
assessing who might be able to do so.

Thanks!
Alessandro

Re: Apache Arrow Cookbook

2021-07-08 Thread Alessandro Molina

Thinking about it, I think that having the cookbook into its own repository
(apache/arrow-cookbook) might lower the barrier for contributors. You only
need to clone the cookbook and running `make` does also take care of
installing the required dependencies, so in theory you don't even need to
care too much about setting up your environment. But we can surely improve
the README in the repo further to ease contributions.

I think we can also preserve the benefit that Nic mentioned of making sure
that on each Arrow build the recipes are verified by triggering a build of
the cookbook repository on each new arrow master change. Worst case, have a
nightly build for the cookbook that clones that latest arrow master branch.

Having a cookbook for C++ is a very good idea, that might be the next step
once we finish the Python and R versions. If people want to contribute
cookbook versions for more languages that would be greatly appreciated too.

On the other hand, while we want to keep the cookbooks in the same
repository and sharing the same infrastructure to keep a low entry barrier
(make py/r/X will just compile the cookbook for the language you picked), I
feel that keeping the cookbook separated per language is a good idea. While
it's cool to be able to compare the solution between languages, in general
developers look for the solution in their target language and might
perceive as noise the other implementations.
For example, we received similar feedback for the Arrow documentation too,
that as a Python developer it's hard to find what you are looking for
because it's mixed with the "format" and "C++" documentation and there are
a few links back and forth between them.

On Thu, Jul 8, 2021 at 11:39 AM Nic  wrote:

> One of the possible aims for the cookbook is having interlinked
> documentation between function docs and the cookbook, and both the R and
> Python docs include tests that all of the outputs are expected.  Including
> these tests means that we can immediately see if any code changes render
> any recipes incorrect.  Therefore the decoupling between cookbook updates
> and docs updates may not be necessary.
>
> That said, there has been mention of having versions of the cookbook tied
> to released versions of Arrow, which sounds like a great idea.
>
> The repo also includes a Makefile which creates all the relevant setup, so
> hopefully that should simplify things for users.  The R cookbook uses
> bookdown, which has a feature where a reader can click an 'edit' button and
> it automatically creates a fork where they can edit the cookbook and submit
> a PR directly from GitHub.
>
> It'd be great to see a lot of recipes in multiple languages, but in the
> document of possible recipes circulated previously, we identified slightly
> different needs for recipes for R/Python, and this may be further
> complicated by writing for slightly different audiences (from what I
> understand, the pyarrow implementation may be more geared towards people
> building on top of the low-level bindings, whereas in R, we have both that
> audience as well as folks who just want to make their dplyr code run faster
> without needing to know that much about the details of Arrow).
>
> I wonder, though, if we could still achieve that by having an additional
> page that points to the recipes that *are* common between each cookbook.
>
> On Thu, 8 Jul 2021 at 10:07, Antoine Pitrou  wrote:
>
> >
> > Hi Rares,
> >
> > Documentation bugs and improvement requests are welcome, feel free to
> > file them on the JIRA!
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 08/07/2021 à 01:45, Rares Vernica a écrit :
> > > Awesome! We would find C++ versions of these recipes very useful. From
> > our
> > > experience the C++ API is much much harder to deal with and error prone
> > > than the R/Python one.
> > >
> > > Cheers,
> > > Rares
> > >
> > > On Wed, Jul 7, 2021 at 9:07 AM Alessandro Molina <
> > > alessan...@ursacomputing.com> wrote:
> > >
> > >> Yes, that was mostly what I meant when I wrote that the next step is
> > >> opening a PR against the apache/arrow repository itself :D
> > >> We moved forward in a separate repository initially to be able to
> cycle
> > >> more quickly, but we reached a point where we think we can start
> > >> integrating the cookbook with the Arrow documentation itself.
> > >>
> > >> If instead it's preferred to move forward the effort into its own
> > separated
> > >> repository (apache/arrow-cookbook) that's an option too, we are open
> to
> > >> suggestions from the community.
> > >>

Re: Apache Arrow Cookbook

2021-07-07 Thread Alessandro Molina

Yes, that was mostly what I meant when I wrote that the next step is
opening a PR against the apache/arrow repository itself :D
We moved forward in a separate repository initially to be able to cycle
more quickly, but we reached a point where we think we can start
integrating the cookbook with the Arrow documentation itself.

If instead it's preferred to move forward the effort into its own separated
repository (apache/arrow-cookbook) that's an option too, we are open to
suggestions from the community.

On Wed, Jul 7, 2021 at 5:57 PM Wes McKinney  wrote:

> What do you think about developing this cookbook in an Apache Arrow
> repository (it could be something like apache/arrow-cookbook, if not
> part of the main development repo)? Creating expanded documentation
> resources for learning how to use Apache Arrow to solve problems seems
> certainly within the bounds of the community's objectives.
>
> On Wed, Jul 7, 2021 at 5:52 PM Alessandro Molina
>  wrote:
> >
> > We finally have a first preview of the cookbook available for R and
> Python,
> > for anyone interested the two versions are visible at
> > http://ursacomputing.com/arrow-cookbook/py/index.html and
> > http://ursacomputing.com/arrow-cookbook/r/index.html
> > A new version of the cookbook is automatically published on each new
> recipe.
> >
> > After gathering feedback from interested parties and users, our plan for
> > the next step would be to open a PR against the arrow repository and
> > automate publishing the cookbook via github actions.
> >
> > At the moment the recipes implemented are nearly half of those that were
> > identified in the dedicated Google Docs (
> >
> https://docs.google.com/document/d/1v-jK_9osnLvAnAjLOM_frgzakjFhLpUi8OC0MlKpxzw/edit?ts=60c73189#heading=h.m7fas2talgy5
> > ) so if you have recipes to suggest feel free to leave comments on that
> > document or suggest edits.
> >
> >
> > On Mon, Jun 21, 2021 at 10:34 AM Alessandro Molina <
> > alessan...@ursacomputing.com> wrote:
> >
> > > Hi,
> > >
> > > I'd like to share with the ML an idea which me and Nic Crane have been
> > > experimenting with. It's still in the early stage, but we hope to turn
> it
> > > into a PR for Arrow documentation soon.
> > >
> > > The idea is to work on a Cookbook, a collection of ready made recipes,
> on
> > > how to use Arrow that both end users and developers of third party
> > > libraries can refer to when they need to look up "the arrow way" of
> doing
> > > something.
> > >
> > > While the arrow documentation reports all features and functions that
> are
> > > available in arrow, it's not always obvious how to best combine them
> for a
> > > new user. Sometimes the solution ends up being more complicated than
> > > necessary or performs badly due to not obvious side effects like
> unexpected
> > > memory copies etc.
> > >
> > > For this reason we thought about starting a documentation that users
> can
> > > refer to on how to combine arrow features to achieve the results they
> care
> > > about.
> > >
> > > We wrote a short document explaining the idea at
> > >
> https://docs.google.com/document/d/1v-jK_9osnLvAnAjLOM_frgzakjFhLpUi8OC0MlKpxzw/edit?usp=sharing
> > >
> > > The core idea behind the cookbook is that all recipes should be
> testable,
> > > so it should be possible to add a CI phase for the cookbook that
> verifies
> > > that all the recipes still work with the current version of Arrow and
> lead
> > > to the expected results.
> > >
> > > At the moment we started it in a separate repository (
> > > https://github.com/ursacomputing/arrow-cookbook ), but we are yet
> unsure
> > > if it should live inside arrow/docs or its own directory (IE:
> > > arrow/cookbook) or its own repository. In the end it's fairly decoupled
> > > from the rest of Arrow and the documentation, which would have the
> benefit
> > > of allowing a dedicated release cycle every time new recipes are added
> (at
> > > least in the early phase).
> > >
> > > We are also looking for more ideas about recipes that would be good
> > > candidates for inclusion, so if any of you has thoughts about which
> recipes
> > > we should add please feel free to comment on the document or reply by
> mail
> > > suggesting more recipes.
> > >
> > > Any suggestion for improvements is appreciated! We hope to have
> something
> > > we can release with the next Arrow release.
> > >
>

Re: Apache Arrow Cookbook

2021-07-07 Thread Alessandro Molina

We finally have a first preview of the cookbook available for R and Python,
for anyone interested the two versions are visible at
http://ursacomputing.com/arrow-cookbook/py/index.html and
http://ursacomputing.com/arrow-cookbook/r/index.html
A new version of the cookbook is automatically published on each new recipe.

After gathering feedback from interested parties and users, our plan for
the next step would be to open a PR against the arrow repository and
automate publishing the cookbook via github actions.

At the moment the recipes implemented are nearly half of those that were
identified in the dedicated Google Docs (
https://docs.google.com/document/d/1v-jK_9osnLvAnAjLOM_frgzakjFhLpUi8OC0MlKpxzw/edit?ts=60c73189#heading=h.m7fas2talgy5
) so if you have recipes to suggest feel free to leave comments on that
document or suggest edits.


On Mon, Jun 21, 2021 at 10:34 AM Alessandro Molina <
alessan...@ursacomputing.com> wrote:

> Hi,
>
> I'd like to share with the ML an idea which me and Nic Crane have been
> experimenting with. It's still in the early stage, but we hope to turn it
> into a PR for Arrow documentation soon.
>
> The idea is to work on a Cookbook, a collection of ready made recipes, on
> how to use Arrow that both end users and developers of third party
> libraries can refer to when they need to look up "the arrow way" of doing
> something.
>
> While the arrow documentation reports all features and functions that are
> available in arrow, it's not always obvious how to best combine them for a
> new user. Sometimes the solution ends up being more complicated than
> necessary or performs badly due to not obvious side effects like unexpected
> memory copies etc.
>
> For this reason we thought about starting a documentation that users can
> refer to on how to combine arrow features to achieve the results they care
> about.
>
> We wrote a short document explaining the idea at
> https://docs.google.com/document/d/1v-jK_9osnLvAnAjLOM_frgzakjFhLpUi8OC0MlKpxzw/edit?usp=sharing
>
> The core idea behind the cookbook is that all recipes should be testable,
> so it should be possible to add a CI phase for the cookbook that verifies
> that all the recipes still work with the current version of Arrow and lead
> to the expected results.
>
> At the moment we started it in a separate repository (
> https://github.com/ursacomputing/arrow-cookbook ), but we are yet unsure
> if it should live inside arrow/docs or its own directory (IE:
> arrow/cookbook) or its own repository. In the end it's fairly decoupled
> from the rest of Arrow and the documentation, which would have the benefit
> of allowing a dedicated release cycle every time new recipes are added (at
> least in the early phase).
>
> We are also looking for more ideas about recipes that would be good
> candidates for inclusion, so if any of you has thoughts about which recipes
> we should add please feel free to comment on the document or reply by mail
> suggesting more recipes.
>
> Any suggestion for improvements is appreciated! We hope to have something
> we can release with the next Arrow release.
>

Re: [python] [iter_batches] Is there any value to an iterator based parquet reader in python?

2021-07-06 Thread Alessandro Molina

I guess that doing it at the Parquet reader level might allow the
implementation to better leverage row groups, without the need to keep in
memory the whole Table when you are iterating over data. While the current
jira issue seems to suggest the implementation for Table once it's already
fully available.

On Tue, Jul 6, 2021 at 8:48 AM Joris Van den Bossche <
jorisvandenboss...@gmail.com> wrote:

> There is a recent JIRA where a row-wise iterator was discussed:
> https://issues.apache.org/jira/browse/ARROW-12970.
>
> This should not be too hard to add (although there is a linked JIRA about
> improving the performance of the pyarrow -> python objects conversion,
> which might require some more engineering work to do), but of course what's
> proposed in the JIRA is starting from a materialized record batch (so
> similarly as the gist here, but I think this is good enough?).
>
> On Tue, 6 Jul 2021 at 05:03, Micah Kornfield 
> wrote:
>
>> I think this type of thing does make sense, at some point people like to
>> be be able see their data in rows.
>>
>> It probably pays to have this conversation on dev@.  Doing this in a
>> performant way might take some engineering work, but having a quick
>> solution like the one described above might make sense.
>>
>> -Micah
>>
>> On Sun, Jun 27, 2021 at 6:23 AM Grant Williams 
>> wrote:
>>
>>> Hello,
>>>
>>> I've found myself wondering if there is a use case for using the
>>> iter_batches method in python as an iterator in a similar style to a
>>> server-side cursor in Postgres. Right now you can use an iterator of record
>>> batches, but I wondered if having some sort of python native iterator might
>>> be worth it? Maybe a .to_pyiter() method that converts it to a lazy &
>>> batched iterator of native python objects?
>>>
>>> Here is some example code that shows a similar result.
>>>
>>> from itertools import chain
>>> from typing import Tuple, Any
>>>
>>> def iter_parquet(parquet_file, columns = None, batch_size=1_000) -> 
>>> Tuple[Any]:
>>>
>>> record_batches = parquet_file.iter_batches(batch_size=batch_size, 
>>> columns=columns)
>>>
>>> # convert from columnar format of pyarrow arrays to a row format of 
>>> python objects (yields tuples)
>>> yield from chain.from_iterable(zip(*map(lambda col: 
>>> col.to_pylist(), batch.columns)) for batch in record_batches)
>>>
>>> (or a gist if you prefer:
>>> https://gist.github.com/grantmwilliams/143fd60b3891959a733d0ce5e195f71d)
>>>
>>> I realize arrow is a columnar format, but I wonder if having the
>>> buffered row reading as a lazy iterator is a common enough use case with
>>> parquet + object storage being so common as a database alternative.
>>>
>>> Thanks,
>>> Grant
>>>
>>> --
>>> Grant Williams
>>> Machine Learning Engineer
>>> https://github.com/grantmwilliams/
>>>
>>

Re: Moving "Improvements" and "New Features" to 6.0.0 release

2021-07-05 Thread Alessandro Molina

I should have left all "In Progress" tasks assigned to 5.0.0, if you find
any task you want to release in 5.0.0 (I guess that means it should be
merged in the next 2 weeks) feel free to reassign it to 5.0.0

That moved the list of TODO tickets for 5.0.0 down to 57 issues (
https://cwiki.apache.org/confluence/display/ARROW/Arrow+5.0.0+Release )

On Sat, Jul 3, 2021 at 3:59 AM Weston Pace  wrote:

> Can you leave the ones marked “in progress” or that have the
> pull-request-available label?
>
> On Thu, Jul 1, 2021 at 11:06 PM Alessandro Molina <
> alessan...@ursacomputing.com> wrote:
>
> > Hi everybody,
> >
> > Given that the expected time for release 5.0.0 is approaching and there
> are
> > 160+ Jira issues assigned to that release (
> > https://cwiki.apache.org/confluence/display/ARROW/Arrow+5.0.0+Release )
> > I'd
> > like to propose to do some cleanup of the TODO by bulk moving all 5.0.0
> > jira issues  flagged as "Improvement" or "New Feature" to 6.0.0. That
> will
> > reduce the scope of Jira issues for 5.0.0 to ~30 bugs which seems a
> > more manageable goal.
> >
> > If anyone has a new feature or improvement they are working on and want
> to
> > include in 5.0.0, those ones will obviously still be able to go. What I'm
> > proposing is just to move the issues by default and then in case anyone
> > wants to keep some of them in 5.0.0 they can be moved back.
> >
> > In general if we move an issue you are involved on, you should receive a
> > notification and should be able to move it back to 5.0.0 if you want to
> > ship it in that release.
> >
> > I hope this helps getting a better understanding of what's going to go in
> > 5.0.0 so that release announcements can be more easily prepared
> >
> > Any thoughts?
> >
>

Moving "Improvements" and "New Features" to 6.0.0 release

2021-07-02 Thread Alessandro Molina

Hi everybody,

Given that the expected time for release 5.0.0 is approaching and there are
160+ Jira issues assigned to that release (
https://cwiki.apache.org/confluence/display/ARROW/Arrow+5.0.0+Release ) I'd
like to propose to do some cleanup of the TODO by bulk moving all 5.0.0
jira issues  flagged as "Improvement" or "New Feature" to 6.0.0. That will
reduce the scope of Jira issues for 5.0.0 to ~30 bugs which seems a
more manageable goal.

If anyone has a new feature or improvement they are working on and want to
include in 5.0.0, those ones will obviously still be able to go. What I'm
proposing is just to move the issues by default and then in case anyone
wants to keep some of them in 5.0.0 they can be moved back.

In general if we move an issue you are involved on, you should receive a
notification and should be able to move it back to 5.0.0 if you want to
ship it in that release.

I hope this helps getting a better understanding of what's going to go in
5.0.0 so that release announcements can be more easily prepared

Any thoughts?

Re: [Format] Bounded numbers?

2021-06-22 Thread Alessandro Molina

On Tue, Jun 22, 2021 at 12:27 PM Antoine Pitrou  wrote:

> On Mon, 21 Jun 2021 23:50:29 -0400
> Ying Zhou  wrote:
> > Hi,
> >
> > In data people use there are often bounded numbers, mostly integers with
> clear and fixed upper and lower bounds but also decimals and floats as well
> e.g. test scores, numerous codes in older databases, max temperature of a
> city, latitudes, longitudes, numerous IDs etc. I wonder whether we should
> include such types in Arrow (and more importantly in Parquet & Avro where
> size matters a lot more).
>
> You are expressing two separate concerns here:
> 1. expressing the semantics (and perhaps enforcing them, e.g. return an
>error when an addition gives a result out of bounds)
>

I wonder if DictionaryArray could be a foundation for such semantics. It
doesn't seem unreasonable to have a check that prevents you from adding
values that are outside of the values accepted by the dictionary. Seems
reasonable to implement most things like test scores, temperatures etc...
Probably unreasonable for things with a bigger domain of valid values like
coordinates and floats in general.

Apache Arrow Cookbook

2021-06-21 Thread Alessandro Molina

Hi,

I'd like to share with the ML an idea which me and Nic Crane have been
experimenting with. It's still in the early stage, but we hope to turn it
into a PR for Arrow documentation soon.

The idea is to work on a Cookbook, a collection of ready made recipes, on
how to use Arrow that both end users and developers of third party
libraries can refer to when they need to look up "the arrow way" of doing
something.

While the arrow documentation reports all features and functions that are
available in arrow, it's not always obvious how to best combine them for a
new user. Sometimes the solution ends up being more complicated than
necessary or performs badly due to not obvious side effects like unexpected
memory copies etc.

For this reason we thought about starting a documentation that users can
refer to on how to combine arrow features to achieve the results they care
about.

We wrote a short document explaining the idea at
https://docs.google.com/document/d/1v-jK_9osnLvAnAjLOM_frgzakjFhLpUi8OC0MlKpxzw/edit?usp=sharing

The core idea behind the cookbook is that all recipes should be testable,
so it should be possible to add a CI phase for the cookbook that verifies
that all the recipes still work with the current version of Arrow and lead
to the expected results.

At the moment we started it in a separate repository (
https://github.com/ursacomputing/arrow-cookbook ), but we are yet unsure if
it should live inside arrow/docs or its own directory (IE: arrow/cookbook)
or its own repository. In the end it's fairly decoupled from the rest of
Arrow and the documentation, which would have the benefit of allowing a
dedicated release cycle every time new recipes are added (at least in the
early phase).

We are also looking for more ideas about recipes that would be good
candidates for inclusion, so if any of you has thoughts about which recipes
we should add please feel free to comment on the document or reply by mail
suggesting more recipes.

Any suggestion for improvements is appreciated! We hope to have something
we can release with the next Arrow release.

Re: [Discuss] [Proposal] [C++] Arrow multithreaded stress test suite

2021-05-19 Thread Alessandro Molina

Another approach that could reduce the amount of heavy tests that we have
to write (if the tests are written in Python) might be to drive the code to
interleave in the ways we feel might introduce problems. Such an approach
can be performed by introducing explicit breakpoints in the code and
starting when the breakpoint is reached the other code that we know might
cause problems when executed concurrently.

For example, Imagine you want to simulate what happens when two threads
write to a file concurrently, you put a Breakpoint on file.write, wait for
that breakpoint to be reached, and explicitly invoke another file.write
when that happens.

That way instead of having to throw tons of threads trying to trigger for
race conditions randomly you can reduce the amount of time/computation by
explicitly searching for problems in the areas that are more certain to
hide them and raising the chances that they happen by forcing the two code
blocks to always run interleaved in the way you expect that might cause
problems.

mock.patch to wrap the function where you want to stop is the easy way to
put those breakpoints in Python usually. At crunch for example was written
a Breakpoint class extensively used to write tests that simulate race
conditions which was released under MIT license (
https://github.com/Crunch-io/diagnose#breakpoints )

On Wed, May 19, 2021 at 9:01 AM Antoine Pitrou  wrote:

>
> Le 19/05/2021 à 07:37, Weston Pace a écrit :
> > I spoke a while ago about working on a multithreaded stress test
> > suite.  I have put together some very early details[1].  I would
> > appreciate any feedback.
>
> I would recommend writing such tests in Python, such as is already done
> for the CSV reader.
>
> > One particular item I could use feedback on is how this gets deployed.
> > In my mind this would be an ongoing test that is continuously running
> > against the previous nightly build.  Such a test would quickly consume
> > Apache's GHA minutes so I don't think GHA is an option.  Other free CI
> > options probably wouldn't have enough minutes for a continuous daily
> > test (e.g. ~40k minutes).
>
> I'm not sure what you have in mind.  You're intending to run this test
> 40k minutes per day?
>
> Regards
>
> Antoine.
>

Re: Pyarrow RecordBatchStreamWriter and dictionaries

2021-05-03 Thread Alessandro Molina

Hi Radu,

I was trying to reproduce the issue you described, but I was unable to
reproduce the problem.
Could you provide an example of how you built the Table?

I tried reproducing it with a table with following schema

pa.schema([
pa.field('nums', pa.list_(pa.int32())),
pa.field('chars', pa.list_(pa.dictionary(pa.int32(), pa.string(
])

but it succeeded serializing correctly

On Fri, Apr 23, 2021 at 6:36 AM Radu Teodorescu
 wrote:

> Hi I am seeing a similar problem when serializing tables with lists of
> dictionary encoded elements: each resulting chunk is pointing to the first
> chunk’s original dictionary.
> Is this a known issue/limitation.
> I can follow with a repro otherwise.
> Thank you
> Radu
>
> > On Sep 28, 2020, at 1:26 PM, Wes McKinney  wrote:
> >
> > hi Al,
> >
> > It's definitely wrong. I confirmed the behavior is present on master.
> >
> > https://issues.apache.org/jira/browse/ARROW-10121
> >
> > I made this a blocker for the release.
> >
> > Thanks,
> > Wes
> >
> > On Mon, Sep 28, 2020 at 10:52 AM Al Taylor
> >  wrote:
> >>
> >> Hi,
> >>
> >> I've found that when I serialize two recordbatches which have a
> dictionary-encoded field, but different encoding dictionaries to a sequence
> of pybytes with a RecordBatchStreamWriter, then deserialize using
> pa.ipc.open_stream(), the dictionaries get jumbled. (or at least, on
> deserialization, the dictionary for the first RB is being reused for the
> second)
> >>
> >> MWE:
> >> ```
> >> import pyarrow as pa
> >> from io import BytesIO
> >>
> >> pa.__version__
> >>
> >> schema = pa.schema([pa.field('foo', pa.int32()), pa.field('bar',
> pa.dictionary(pa.int32(), pa.string()))] )
> >> r1 = pa.record_batch(
> >>[
> >>[1, 2, 3, 4, 5],
> >>pa.array(["a", "b", "c", "d", "e"]).dictionary_encode()
> >>],
> >>schema
> >> )
> >>
> >> r1.validate()
> >> r2 = pa.record_batch(
> >>[
> >>[1, 2, 3, 4, 5],
> >>pa.array(["c", "c", "e", "f", "g"]).dictionary_encode()
> >>],
> >>schema
> >> )
> >>
> >> r2.validate()
> >>
> >> assert r1.column(1).dictionary != r2.column(1).dictionary
> >>
> >>
> >> sink =  pa.BufferOutputStream()
> >> writer = pa.RecordBatchStreamWriter(sink, schema)
> >>
> >> writer.write(r1)
> >> writer.write(r2)
> >>
> >> serialized = BytesIO(sink.getvalue().to_pybytes())
> >> stream = pa.ipc.open_stream(serialized)
> >>
> >> deserialized = []
> >>
> >> while True:
> >>try:
> >>deserialized.append(stream.read_next_batch())
> >>except StopIteration:
> >>break
> >>
> >> deserialized[0].column(1).to_pylist()
> >> deserialized[1].column(1).to_pylist()
> >> ```
> >> (The last line of the above prints out `['a', 'a', 'b', 'c', 'd']`.
> This behaviour doesn't look right. I was wondering whether I'm simply not
> using the library correctly or if this is a bug in pyarrow.
> >>
> >> Thanks,
> >>
> >> Al
>
>

Re: [Python] Who has been able to use PyArrow 4.0.0?

2021-04-28 Thread Alessandro Molina

Are you sure you haven't installed `libarrow` (the CPP one) manually
independently from pyarrow?

In your traceback you have that the symbol has not been found in
"/usr/local/lib/libarrow.400.dylib"

But that smells like an independently installed libarrow, as the libarrow
provided by pyarrow should exist in the pytnon environment (in my case for
example  /usr/local/lib/python3.9/site-packages/pyarrow/libarrow.400.dylib
) I suspect your system installed libarrow is taking precedence over the
one provided by pyarrow and the two might not match.

On Wed, Apr 28, 2021 at 10:05 AM Ying Zhou  wrote:

> Hi,
>
> It turns out that I haven’t been able to use PyArrow 4.0.0 either in Conda
> environments or python venvs. PyArrow does install using pip. However this
> is what I get if I ever want to use it:
>
> >>> import pyarrow as pa
> Traceback (most recent call last):
>   File "", line 1, in 
>   File
> "/Users/karlkatzen/anaconda3/lib/python3.8/site-packages/pyarrow/__init__.py",
> line 63, in 
> import pyarrow.lib as _lib
> ImportError:
> dlopen(/Users/karlkatzen/anaconda3/lib/python3.8/site-packages/pyarrow/
> lib.cpython-38-darwin.so, 2): Symbol not found:
> __ZN5arrow10StopSource5tokenEv
>   Referenced from:
> /Users/karlkatzen/anaconda3/lib/python3.8/site-packages/pyarrow/
> lib.cpython-38-darwin.so
>   Expected in: /usr/local/lib/libarrow.400.dylib
>  in /Users/karlkatzen/anaconda3/lib/python3.8/site-packages/pyarrow/
> lib.cpython-38-darwin.so
> >>> pa
> Traceback (most recent call last):
>   File "", line 1, in 
> NameError: name 'pa' is not defined
>
> On the other hand a Conda installation is not even possible. Does anyone
> know what’s going on?
>
> Ying

Re: [DISCUSS] [Rust] Python-datafusion

2021-04-26 Thread Alessandro Molina

Would "incorporate" mean that the codebase is moved into the arrow
repository or is the plan to keep a separate repository
for datafusion-python but under the apache org?

On Sun, Apr 25, 2021 at 10:40 PM Daniël Heres  wrote:

> Hi Jorge,
>
> Awesome, I think this is a super valuable addition and makes DataFusion
> much more accessible / approachable for anyone wanting to experiment with
> DataFusion.
> Would be very cool to update it to the latest version and include it in the
> project.
>
> Best,
>
> Daniël
>
> On Sun, Apr 25, 2021, 22:32 Micah Kornfield  wrote:
>
> > Hi Jorge,
> > I think this would certainly be a valuable contribution.  How were you
> > thinking of hosting (which repo)/publishing it (maintaintaining a
> separate
> > wheel)?  Also did you have thoughts integration testing with pyarrow?
> >
> > Cheers,
> > Micah
> >
> > On Sun, Apr 25, 2021 at 9:13 AM Jorge Cardoso Leitão <
> > jorgecarlei...@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > I fielded a PR [1] to open up a discussion to incorporate
> > python-datafusion
> > > [2] into the Apache Arrow project.
> > >
> > > Python-datafusion is a Python library [3] built on top of DataFusions
> > that
> > > enables people to use DataFusion from Python. It leverages the C data
> > > interface for zero-cost copy between DataFusion and pyarrow (a bunch of
> > > pointers is shared around).
> > >
> > > For example, it allows users to read a CSV from Rust, pass the arrays
> to
> > a
> > > C++ kernel, continue the computation in Rust's kernels, and export to
> > > parquet using Rust (or C++ parquet, or whatever ^_^). It supports UDFs
> > and
> > > UDAFs, in case someone wants to go crazy with Pyarrow, Pandas, numpy or
> > > tensorflow. =)
> > >
> > > Best,
> > > Jorge
> > >
> > > [1] https://github.com/apache/arrow-datafusion/pull/69
> > > [2] https://github.com/jorgecarleitao/datafusion-python
> > > [3] https://pypi.org/project/datafusion/
> > >rer
> >
>

56 matches

Mail list logo