jorgecarleitao opened a new pull request #8401: URL: https://github.com/apache/arrow/pull/8401
This PR is a proposal to add support to the [C data interface](https://arrow.apache.org/docs/format/CDataInterface.html) by implementing the necessary functionality to both consume and produce structs with its ABI and lifetime rules. This is for now limited to primitive types, but it is easily generalized for all types whose data is encapsulated in `ArrayData` (things with buffers and child data). For types where this does not happen (such as `StructArray`, `DictionaryArray` and `ListArray`, where data also resides on their specific struct implementations instead of `ArrayData`), I believe that more work is needed. IMO we should strive to have all data in `ArrayData`, as it makes it significantly easier to export it via the C Data Interface, as well as understanding what they physically contain. Some design choices: * import and export does not care about the type of the data that is in memory (previously `BufferData`, now `Bytes`) - it only cares about how they should be converted from and to `ArrayData` to the C data interface. * import wraps incoming pointers on a struct behind an `Arc`, so that we thread-safely refcount them and can share them between buffers, arrays, etc. * `export` places `Buffer`s in `private_data` for bookkeeping and release them when the consumer releases it via `release`. I do not expect this PR to be easy to review, as it is touching sensitive (aka `unsafe`) code. However, based on the tests I did so far, I am sufficiently happy to PR it. I tried to comment as much as possible, which I will continue to do so in the following commits. This PR has three main parts: 1. Addition of an `ffi` module that contains the import and export functionality 2. Add some helpers to import and export an Array from C Data Interface 3. A crate to test this against Python/C++'s API It also does a small refactor of `BufferData`, renaming it to `Bytes` (motivated by the popular `bytes` crate), and moving it to a separate file. What is tested: * round-trip `Python -> Rust -> Python` (new separate crate, `arrow-c-integration`) * round-trip `Rust -> Rust -> Rust` What is not tested yet: * round-trip `Rust -> Python -> Rust` * memory allocation (I am still trying to find a way of doing this in Rust) Things to do: * [ ] CI for the Python tests: it requires different compilation flags, which requires compiling the whole thing. I `excluded` it from the workspace as it does not behave well with rust-analyzer, but we need to add it to the CI nevertheless. * [ ] Add more comments * [ ] Add more tests * [ ] Extend for all types that only use buffers * [ ] Error on types that are not supported Finally, this PR has a large contribution of @pitrou , that took _a lot_ of his time to explain to me how the C++ was doing it and the main things that I had to worry about here. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
