JayjeetAtGithub opened a new pull request #10431:
URL: https://github.com/apache/arrow/pull/10431


   The implementation includes a new `RadosParquetFileFormat` class that 
inherits from the `ParquetFileFormat` class to defer the evaluation of scan 
operations on a Parquet dataset to a RADOS storage backend. This new file 
format plugs into the `FileSystemDataset` API, converts filenames to object IDs 
using FS metadata and uses the 
[librados](https://docs.ceph.com/en/latest/rados/api/librados-intro/) C++ 
library to execute storage side functions that scan the files on the 
[Ceph](https://ceph.io) storage nodes (OSDs) using Arrow libraries. We ship 
unit and integration tests with our implementation where the tests are run 
against a single-node Ceph cluster.
   
   The storage-side code is implemented as a RADOS CLS (object storage class) 
using [Ceph's Object Class 
SDK](https://docs.ceph.com/en/latest/rados/api/objclass-sdk/#:~:text=Ceph%20can%20be%20extended%20by,object%20classes%20within%20the%20tree.).
 The code lives in `cpp/src/arrow/adapters/arrow-rados-cls`, and is expected to 
be deployed on the storage nodes (Ceph's OSDs) prior to operating on tables via 
the `RadosParquetFileFormat` implementation. This PR includes a CMake 
configuration for building this library if desired (`ARROW_CLS` CMake option). 
We have also added Python bindings for our C++ implementations and added 
integration tests for them.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to