[DISCUSS] Growing the Arrow libraries outside of the core in-memory columnar data structures

Wes McKinney Fri, 05 May 2017 16:46:26 -0700

hi folks,

The Arrow implementations are built generally with a similar layering of
tools:


* Buffer objects supporting reference counting and zero-copy slicing
* Type metadata
* Vector/Array containers, and Record Batch containers
* IO interfaces for handling stream-like objects, files, memory-maps, etc.
* IPC/Messaging loaders and unloaders (for the streaming and file formats)

Outside of implementing and integration testing more data types, we have
reached a point where these things have been reasonably hardened for
production use in both the Java and C++ libraries.

In C++ at least, there are additional applications that benefit from
sharing some of these primitive components, particularly the Buffer
abstraction (for zero-copy memory references, and reference counting via
std::shared_ptr) and type metadata.

In 0.3, we added the arrow::Tensor type, a traditional multidimensional
array object, which uses arrow::Buffer, arrow::DataType (and its
fixed-width subclasses), and some of the IO / IPC machinery for writing and
reading with shared memory. It was nice that implementing this on top of
the Arrow stack made things really easy.

I don't see a particular issue with creating add-on libraries and
additional data structures that utilize the IO / IPC tools, but we'll need
to be mindful to explain that in expanding the libraries (the C/C++
libraries in particular), we are not expanding the definition of what it
means to "implement Arrow" (i.e. we would not expect other implementations
to be able to handle data structures other than the primary columnar Arrow
data). I think this is mainly a documentation and messaging question.

Any other thoughts around this, will be interested in the opinions of
others.

Thanks,
Wes

[DISCUSS] Growing the Arrow libraries outside of the core in-memory columnar data structures

Reply via email to