Dear Arrow community, We are writing to share our thoughts about designing an Apache Arrow-native storage system leveraging Ceph’s extensibility mechanism as part of the SkyhookDM <http://skyhookdm.com> project and aim for a design that leverages Arrow as much as possible, both on the client API and within the storage system (as embedded library). We would really appreciate your feedback and guidance.
We are planning to write a Dataset API implementation on top of the Ceph RADOS API (see high-level design here <https://docs.google.com/document/d/1bd7FW2TNaIy-nE5OfI1rUYEcUM5F9h3t4excfpuJr3g>). In case you are not familiar with RADOS, librados is a low-level object storage interface that provides primitives to store single, non-striped objects in Ceph. It is the underlying layer for Ceph’s S3 implementation (RGW), as well as the block device (RBD) and POSIX file system interface (CephFS). One salient feature of librados is its “CLS” plugin mechanism, allowing developers to extend the base functionality of objects, in the true object-oriented programming sense of the word “object”. A base RADOS class can be extended (see here <https://makedist.com/files/cephalocon19-objclass-dev.pdf> for more details) so that object instantiations are augmented with storage-side compute capabilities. Our plan is to implement Arrow-native functionality on librados by creating a derived class, arrow::dataset::RadosFormat, from the arrow::dataset::FileFormat base class (see high-level diagram on the doc linked previously). Similarly to how CSV, Parquet and IPC formats are currently implemented, the new implementation (arrow/dataset/file_rados.* files) will consist of ScanTask, ScanTaskIterator and FileFormat implementations that will send/receive operations to Ceph storage nodes using librados. On the storage side, the cls_arrow extension will embed the C++ Arrow library in order to provide all the supported features of the Dataset API (filtering, projecting, etc.), using arrow::ipc to operate on record batches. Our main concern is that this new arrow::dataset::RadosFormat class will be deriving from the arrow::dataset::FileFormat class, which seems to raise a conceptual mismatch as there isn’t really a RADOS format but rather a formatting/serialization deferral that will be taking place, effectively introducing a new client-server layer in the Dataset API. We realize that this might be forcing the Dataset API to accommodate something it wasn’t originally designed to do. An alternative would be to extend arrow::fs; but, to the best of our knowledge, expressions, filters, etc. are not visible at this level, as this is only dealing with bytestreams. We looked at arrow::flight as well but this also seems inappropriate given that Ceph already provides the messaging capabilities for communicating with its own storage backend. We look forward to your comments Many thanks!