Re: IO considerations for PyArrow

pino patera Thu, 16 Jun 2016 02:12:50 -0700

 Looks  a good idea.In order to take advantage of async IO, it'd be nice
having a concept of chunking for large objects and "pipelining" in the
sense of starting the serialization/deserelezation while reading/writing
the chunks. In some application, it can be very useful when dealing with
large objects for instance with timeseries it'd be possible starting a
partial computation on the first semantically complete chunk, while chunks
are retievied by IO subsystem.


--Pino

On Thu, Jun 16, 2016 at 3:36 AM Wes McKinney <[email protected]> wrote:

> Hi folks,
>
> I put some more thought into the "IO problem" as it relates Arrow in
> C++ (and transitively, Python) and wrote a short Google document with
> my thoughts on it:
>
>
> https://docs.google.com/document/d/16y-eyIgSVL8m5Q7Mmh-jIDRwlh-r0bYatYuDl4sbMIk/edit#
>
> Feedback greatly appreciated! This will be on my critical path in the
> near future, so I would like to know if I'm approaching the problem
> right, and we are in alignment (then can break things down into a
> bunch of JIRAs).
>
> (I can also post this doc directly to the mailing list, I thought the
> initial discussion would be simpler in a GDoc)
>
> Thank you
> Wes
>
> On Wed, Jun 8, 2016 at 4:11 PM, Wes McKinney <[email protected]> wrote:
> > On Fri, Jun 3, 2016 at 10:16 AM, Micah Kornfield <[email protected]>
> wrote:
> >> Hi Wes,
> >>
> >> At what level do you imagine, the "opt-in" happening.  Right now it
> >> seems like it would be fairly straightforward at build time.  However,
> >> when we start packaging pyarrow for distribution how do you imagine it
> >> will work? (If [1] already answers this, please let me know, I've been
> >> meaning to take a look at it).
> >>
> >
> > Where packaging and distribution is concerned, it'd be easiest to
> > provide non-picky users with a kitchen sink build, but otherwise
> > developers could create precisely the build they want with CMake
> > flags, I guess. If certain libraries aren't found then we wouldn't
> > fail the build by default, for example.
> >
> >> I need to grok the python code base a little bit more to understand
> >> the implications of the scope creep and the pain around taking a more
> >> fine-grained component approach.  But in general my experience has
> >> been that packaging things together while maintaining clear internal
> >> code boundaries for later separation is a good pragmatic approach.
> >>
> >
> > I'd propose creating an `arrow_io` leaf shared library where we can
> > create a small IO subsystem for reuse amongst different data
> > connectors. We can leave things fairly coarse grained for the time
> > being and break things up later if it becomes onerous for other Arrow
> > developer-users.
> >
> >> As a side note, hopefully, we'll be able to re-use some existing
> >> projects to do the heavy lifting for blob store integration.  SFrame
> >> is one option [2] and [3] might be worth investigating as well (both
> >> appear to be Apache 2.0 licensed).
> >
> > While requiring Java + $HADOOP_HOME for HDFS connectivity (wrapper
> > around libhdfs) doesn't excite me that much, the prospect of bugs (or
> > secure cluster issues) creeping up from a 3rd-party HDFS client
> > without the ability to escalate problems to the Apache Hadoop team
> > worries me even more. There is a new official C++ HDFS client in the
> > works after the libhdfs3 patch was not accepted
> > (https://issues.apache.org/jira/browse/HDFS-8707), so this may be
> > worth pursuing once it matures.
> >
> > Thoughts on this welcome.
> >
> > - Wes
> >
> >>
> >> Thanks,
> >> -Micah
> >>
> >> [1] https://github.com/apache/arrow/pull/79/files
> >> [2]
> https://github.com/apache/incubator-hawq/tree/master/depends/libhdfs3
> >> [3] https://github.com/aws/aws-sdk-cpp
> >>
> >>
>

Re: IO considerations for PyArrow

Reply via email to