Arrow Dataset API on Ceph

Ivo Jimenez Wed, 26 Aug 2020 22:02:50 -0700

Dear Arrow community,

We are writing to share our thoughts about designing an Apache Arrow-native
storage system leveraging Ceph’s extensibility mechanism as part of the
SkyhookDM <http://skyhookdm.com> project and aim for a design that
leverages Arrow as much as possible, both on the client API and within the
storage system (as embedded library). We would really appreciate your
feedback and guidance.


We are planning to write a Dataset API implementation on top of the Ceph
RADOS API (see high-level design here
<https://docs.google.com/document/d/1bd7FW2TNaIy-nE5OfI1rUYEcUM5F9h3t4excfpuJr3g>).
In case you are not familiar with RADOS, librados is a low-level object
storage interface that provides primitives to store single, non-striped
objects in Ceph. It is the underlying layer for Ceph’s S3 implementation
(RGW), as well as the block device (RBD) and POSIX file system interface
(CephFS).

One salient feature of librados is its “CLS” plugin mechanism, allowing
developers to extend the base functionality of objects, in the true
object-oriented programming sense of the word “object”. A base RADOS class
can be extended (see here
<https://makedist.com/files/cephalocon19-objclass-dev.pdf> for more
details) so that object instantiations are augmented with storage-side
compute capabilities.

Our plan is to implement Arrow-native functionality on librados by creating
a derived class, arrow::dataset::RadosFormat, from the
arrow::dataset::FileFormat base class (see high-level diagram on the doc
linked previously). Similarly to how CSV, Parquet and IPC formats are
currently implemented, the new implementation (arrow/dataset/file_rados.*
files) will consist of ScanTask, ScanTaskIterator and FileFormat
implementations that will send/receive operations to Ceph storage nodes
using librados. On the storage side, the cls_arrow extension will embed the
C++ Arrow library in order to provide all the supported features of the
Dataset API (filtering, projecting, etc.), using arrow::ipc to operate on
record batches.

Our main concern is that this new arrow::dataset::RadosFormat class will be
deriving from the arrow::dataset::FileFormat class, which seems to raise a
conceptual mismatch as there isn’t really a RADOS format but rather a
formatting/serialization deferral that will be taking place, effectively
introducing a new client-server layer in the Dataset API.

We realize that this might be forcing the Dataset API to accommodate
something it wasn’t originally designed to do. An alternative would be to
extend  arrow::fs; but, to the best of our knowledge, expressions, filters,
etc. are not visible at this level, as this is only dealing with
bytestreams. We looked at arrow::flight as well but this also seems
inappropriate given that Ceph already provides the messaging capabilities
for communicating with its own storage backend.

We look forward to your comments

Many thanks!

Arrow Dataset API on Ceph

Reply via email to