Re: IO considerations for PyArrow

Uwe Korn Fri, 17 Jun 2016 07:02:15 -0700

Hello Wes,

the concept sounds sensible and really useful.

Probably the implementation will reside in the beginning fully inside ofArrow but do you plan to split it up into a separate package later on?


Cheers
Uwe

On 16.06.16 03:35, Wes McKinney wrote:

Hi folks,

I put some more thought into the "IO problem" as it relates Arrow in
C++ (and transitively, Python) and wrote a short Google document with
my thoughts on it:

https://docs.google.com/document/d/16y-eyIgSVL8m5Q7Mmh-jIDRwlh-r0bYatYuDl4sbMIk/edit#

Feedback greatly appreciated! This will be on my critical path in the
near future, so I would like to know if I'm approaching the problem
right, and we are in alignment (then can break things down into a
bunch of JIRAs).

(I can also post this doc directly to the mailing list, I thought the
initial discussion would be simpler in a GDoc)

Thank you
Wes

On Wed, Jun 8, 2016 at 4:11 PM, Wes McKinney <wesmck...@gmail.com> wrote:

On Fri, Jun 3, 2016 at 10:16 AM, Micah Kornfield <emkornfi...@gmail.com> wrote:

Hi Wes,

At what level do you imagine, the "opt-in" happening.  Right now it
seems like it would be fairly straightforward at build time.  However,
when we start packaging pyarrow for distribution how do you imagine it
will work? (If [1] already answers this, please let me know, I've been
meaning to take a look at it).

Where packaging and distribution is concerned, it'd be easiest to
provide non-picky users with a kitchen sink build, but otherwise
developers could create precisely the build they want with CMake
flags, I guess. If certain libraries aren't found then we wouldn't
fail the build by default, for example.

I need to grok the python code base a little bit more to understand
the implications of the scope creep and the pain around taking a more
fine-grained component approach.  But in general my experience has
been that packaging things together while maintaining clear internal
code boundaries for later separation is a good pragmatic approach.

I'd propose creating an `arrow_io` leaf shared library where we can
create a small IO subsystem for reuse amongst different data
connectors. We can leave things fairly coarse grained for the time
being and break things up later if it becomes onerous for other Arrow
developer-users.

As a side note, hopefully, we'll be able to re-use some existing
projects to do the heavy lifting for blob store integration.  SFrame
is one option [2] and [3] might be worth investigating as well (both
appear to be Apache 2.0 licensed).

While requiring Java + $HADOOP_HOME for HDFS connectivity (wrapper
around libhdfs) doesn't excite me that much, the prospect of bugs (or
secure cluster issues) creeping up from a 3rd-party HDFS client
without the ability to escalate problems to the Apache Hadoop team
worries me even more. There is a new official C++ HDFS client in the
works after the libhdfs3 patch was not accepted
(https://issues.apache.org/jira/browse/HDFS-8707), so this may be
worth pursuing once it matures.

Thoughts on this welcome.

- Wes

Thanks,
-Micah

[1] https://github.com/apache/arrow/pull/79/files
[2] https://github.com/apache/incubator-hawq/tree/master/depends/libhdfs3
[3] https://github.com/aws/aws-sdk-cpp

Re: IO considerations for PyArrow

Reply via email to