Hello,
I would also embrace the scope creep. As we will deal with a lot of
data, the cross-language I/O impact will significantly matter for
performance at the end. We definitely have to be careful in making the
dependencies toggleable in the build system. You should be able to
easily get a build with all dependencies but also it can very selective
on which ones are included in a build.
For HDFS and S3 support, I'm not sure if either arrow-cpp, pyarrow or
parquet-cpp is the right place for their C++ implementation. For
arrow-cpp it would be the same scope creep as for PyArrow and it could
be already used by C++ Arrow users but in parquet-cpp these IO classes
would also be helpful for the non-arrow users. For the moment I would
put the C++ implementations into arrow-cpp, as this keeps the scope
creep in Arrow itself but already provides value to the C++ users and
other languages building on that layer.
Cheers,
Uwe
On 01.06.16 02:44, Wes McKinney wrote:
hi folks,
I wanted to bring up what is likely to become an issue very soon in
the context of our work to provide an Arrow-based Parquet interface
for Python Arrow users.
https://github.com/apache/arrow/pull/83
At the moment, parquet-cpp features an API that enables reading a file
from local disk (using C standard library calls):
https://github.com/apache/parquet-cpp/blob/master/src/parquet/file/reader.h#L111
This is fine for now, however we will quickly need to deal with a few
additional sources of data:
1) File-like Python objects (i.e. an object that has `seek`, `tell`,
and `read` methods)
2) Remote blob stores: HDFS and S3
Implementing #1 at present is a routine exercise in using the Python C
API. #2 is less so -- one of the approaches that has been taken by
others is to create separate Python file-like wrapper classes for
remote storage to make them seem file like. This has multiple
downsides:
- read/seek/tell calls must cross up into the Python interpreter and
back down into the C++ layer
- bytes buffered by read calls get copied into Python bytes objects
(see PyBytes_FromStringAndSize)
Outside of the GIL / concurrency issues, there's efficiency loss that
can be remedied by implementing instead:
- Direct C/C++-level interface (independent of Python interpreter)
with remote blob stores. These can then buffer bytes directly in the
form requested by other C++ consumer libraries (like parquet-cpp)
- Implement a Python file-like interface, so that users can still get
at the bytes in pure Python if they want (for example: some functions,
like pandas.read_csv, primarily deal with file-like things)
This is a clearly superior solution, and has been notably pursued in
recent times by Dato's SFrame library (BSD 3-clause):
https://github.com/dato-code/SFrame/tree/master/oss_src/fileio
The problem, however, is the inevitable scope creep for the PyArrow
Python package. Unlike some other programming languages, Python
programmers face a substantial development complexity burden if they
choose to break libraries containing C extensions into smaller
components, as libraries must define "internal" C APIs for each other
to connect together . Notable example is NumPy
(http://docs.scipy.org/doc/numpy-1.10.1/reference/c-api.html), whose C
API is already being used in PyArrow.
I've been thinking about this problem for several weeks, and my net
recommendation is that we embrace the scope creep in PyArrow (as long
as we try to make optional features, e.g. low-level S3 / libhdfs
integration, "opt-in" versus required for all users). I'd like to hear
from some others, though (e.g. Uwe, Micah, etc.).
thanks,
Wes