JayjeetAtGithub opened a new pull request #10431: URL: https://github.com/apache/arrow/pull/10431
The implementation includes a new `RadosParquetFileFormat` class that inherits from the `ParquetFileFormat` class to defer the evaluation of scan operations on a Parquet dataset to a RADOS storage backend. This new file format plugs into the `FileSystemDataset` API, converts filenames to object IDs using FS metadata and uses the [librados](https://docs.ceph.com/en/latest/rados/api/librados-intro/) C++ library to execute storage side functions that scan the files on the [Ceph](https://ceph.io) storage nodes (OSDs) using Arrow libraries. We ship unit and integration tests with our implementation where the tests are run against a single-node Ceph cluster. The storage-side code is implemented as a RADOS CLS (object storage class) using [Ceph's Object Class SDK](https://docs.ceph.com/en/latest/rados/api/objclass-sdk/#:~:text=Ceph%20can%20be%20extended%20by,object%20classes%20within%20the%20tree.). The code lives in `cpp/src/arrow/adapters/arrow-rados-cls`, and is expected to be deployed on the storage nodes (Ceph's OSDs) prior to operating on tables via the `RadosParquetFileFormat` implementation. This PR includes a CMake configuration for building this library if desired (`ARROW_CLS` CMake option). We have also added Python bindings for our C++ implementations and added integration tests for them. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
