Re: IO considerations for PyArrow

Uwe Korn Fri, 03 Jun 2016 06:27:13 -0700

Hello,

I would also embrace the scope creep. As we will deal with a lot ofdata, the cross-language I/O impact will significantly matter forperformance at the end. We definitely have to be careful in making thedependencies toggleable in the build system. You should be able toeasily get a build with all dependencies but also it can very selectiveon which ones are included in a build.

For HDFS and S3 support, I'm not sure if either arrow-cpp, pyarrow orparquet-cpp is the right place for their C++ implementation. Forarrow-cpp it would be the same scope creep as for PyArrow and it couldbe already used by C++ Arrow users but in parquet-cpp these IO classeswould also be helpful for the non-arrow users. For the moment I wouldput the C++ implementations into arrow-cpp, as this keeps the scopecreep in Arrow itself but already provides value to the C++ users andother languages building on that layer.


Cheers,

Uwe

On 01.06.16 02:44, Wes McKinney wrote:

hi folks,

I wanted to bring up what is likely to become an issue very soon in
the context of our work to provide an Arrow-based Parquet interface
for Python Arrow users.

https://github.com/apache/arrow/pull/83

At the moment, parquet-cpp features an API that enables reading a file
from local disk (using C standard library calls):

https://github.com/apache/parquet-cpp/blob/master/src/parquet/file/reader.h#L111

This is fine for now, however we will quickly need to deal with a few
additional sources of data:

1) File-like Python objects (i.e. an object that has `seek`, `tell`,
and `read` methods)
2) Remote blob stores: HDFS and S3

Implementing #1 at present is a routine exercise in using the Python C
API. #2 is less so -- one of the approaches that has been taken by
others is to create separate Python file-like wrapper classes for
remote storage to make them seem file like. This has multiple
downsides:

- read/seek/tell calls must cross up into the Python interpreter and
back down into the C++ layer
- bytes buffered by read calls get copied into Python bytes objects
(see PyBytes_FromStringAndSize)

Outside of the GIL / concurrency issues, there's efficiency loss that
can be remedied by implementing instead:

- Direct C/C++-level interface (independent of Python interpreter)
with remote blob stores. These can then buffer bytes directly in the
form requested by other C++ consumer libraries (like parquet-cpp)

- Implement a Python file-like interface, so that users can still get
at the bytes in pure Python if they want (for example: some functions,
like pandas.read_csv, primarily deal with file-like things)

This is a clearly superior solution, and has been notably pursued in
recent times by Dato's SFrame library (BSD 3-clause):

https://github.com/dato-code/SFrame/tree/master/oss_src/fileio

The problem, however, is the inevitable scope creep for the PyArrow
Python package. Unlike some other programming languages, Python
programmers face a substantial development complexity burden if they
choose to break libraries containing C extensions into smaller
components, as libraries must define "internal" C APIs for each other
to connect together . Notable example is NumPy
(http://docs.scipy.org/doc/numpy-1.10.1/reference/c-api.html), whose C
API is already being used in PyArrow.

I've been thinking about this problem for several weeks, and my net
recommendation is that we embrace the scope creep in PyArrow (as long
as we try to make optional features, e.g. low-level S3 / libhdfs
integration, "opt-in" versus required for all users). I'd like to hear
from some others, though (e.g. Uwe, Micah, etc.).

thanks,
Wes

Re: IO considerations for PyArrow

Reply via email to