hi folks, The Arrow implementations are built generally with a similar layering of tools:
* Buffer objects supporting reference counting and zero-copy slicing * Type metadata * Vector/Array containers, and Record Batch containers * IO interfaces for handling stream-like objects, files, memory-maps, etc. * IPC/Messaging loaders and unloaders (for the streaming and file formats) Outside of implementing and integration testing more data types, we have reached a point where these things have been reasonably hardened for production use in both the Java and C++ libraries. In C++ at least, there are additional applications that benefit from sharing some of these primitive components, particularly the Buffer abstraction (for zero-copy memory references, and reference counting via std::shared_ptr) and type metadata. In 0.3, we added the arrow::Tensor type, a traditional multidimensional array object, which uses arrow::Buffer, arrow::DataType (and its fixed-width subclasses), and some of the IO / IPC machinery for writing and reading with shared memory. It was nice that implementing this on top of the Arrow stack made things really easy. I don't see a particular issue with creating add-on libraries and additional data structures that utilize the IO / IPC tools, but we'll need to be mindful to explain that in expanding the libraries (the C/C++ libraries in particular), we are not expanding the definition of what it means to "implement Arrow" (i.e. we would not expect other implementations to be able to handle data structures other than the primary columnar Arrow data). I think this is mainly a documentation and messaging question. Any other thoughts around this, will be interested in the opinions of others. Thanks, Wes
