Looks a good idea.In order to take advantage of async IO, it'd be nice having a concept of chunking for large objects and "pipelining" in the sense of starting the serialization/deserelezation while reading/writing the chunks. In some application, it can be very useful when dealing with large objects for instance with timeseries it'd be possible starting a partial computation on the first semantically complete chunk, while chunks are retievied by IO subsystem.
--Pino On Thu, Jun 16, 2016 at 3:36 AM Wes McKinney <[email protected]> wrote: > Hi folks, > > I put some more thought into the "IO problem" as it relates Arrow in > C++ (and transitively, Python) and wrote a short Google document with > my thoughts on it: > > > https://docs.google.com/document/d/16y-eyIgSVL8m5Q7Mmh-jIDRwlh-r0bYatYuDl4sbMIk/edit# > > Feedback greatly appreciated! This will be on my critical path in the > near future, so I would like to know if I'm approaching the problem > right, and we are in alignment (then can break things down into a > bunch of JIRAs). > > (I can also post this doc directly to the mailing list, I thought the > initial discussion would be simpler in a GDoc) > > Thank you > Wes > > On Wed, Jun 8, 2016 at 4:11 PM, Wes McKinney <[email protected]> wrote: > > On Fri, Jun 3, 2016 at 10:16 AM, Micah Kornfield <[email protected]> > wrote: > >> Hi Wes, > >> > >> At what level do you imagine, the "opt-in" happening. Right now it > >> seems like it would be fairly straightforward at build time. However, > >> when we start packaging pyarrow for distribution how do you imagine it > >> will work? (If [1] already answers this, please let me know, I've been > >> meaning to take a look at it). > >> > > > > Where packaging and distribution is concerned, it'd be easiest to > > provide non-picky users with a kitchen sink build, but otherwise > > developers could create precisely the build they want with CMake > > flags, I guess. If certain libraries aren't found then we wouldn't > > fail the build by default, for example. > > > >> I need to grok the python code base a little bit more to understand > >> the implications of the scope creep and the pain around taking a more > >> fine-grained component approach. But in general my experience has > >> been that packaging things together while maintaining clear internal > >> code boundaries for later separation is a good pragmatic approach. > >> > > > > I'd propose creating an `arrow_io` leaf shared library where we can > > create a small IO subsystem for reuse amongst different data > > connectors. We can leave things fairly coarse grained for the time > > being and break things up later if it becomes onerous for other Arrow > > developer-users. > > > >> As a side note, hopefully, we'll be able to re-use some existing > >> projects to do the heavy lifting for blob store integration. SFrame > >> is one option [2] and [3] might be worth investigating as well (both > >> appear to be Apache 2.0 licensed). > > > > While requiring Java + $HADOOP_HOME for HDFS connectivity (wrapper > > around libhdfs) doesn't excite me that much, the prospect of bugs (or > > secure cluster issues) creeping up from a 3rd-party HDFS client > > without the ability to escalate problems to the Apache Hadoop team > > worries me even more. There is a new official C++ HDFS client in the > > works after the libhdfs3 patch was not accepted > > (https://issues.apache.org/jira/browse/HDFS-8707), so this may be > > worth pursuing once it matures. > > > > Thoughts on this welcome. > > > > - Wes > > > >> > >> Thanks, > >> -Micah > >> > >> [1] https://github.com/apache/arrow/pull/79/files > >> [2] > https://github.com/apache/incubator-hawq/tree/master/depends/libhdfs3 > >> [3] https://github.com/aws/aws-sdk-cpp > >> > >> >
