jorgecarleitao opened a new pull request #8401:
URL: https://github.com/apache/arrow/pull/8401


   This PR is a proposal to add support to the [C data 
interface](https://arrow.apache.org/docs/format/CDataInterface.html) by 
implementing the necessary functionality to both consume and produce structs 
with its ABI and lifetime rules.
   
   This is for now limited to primitive types, but it is easily generalized for 
all types whose data is encapsulated in `ArrayData` (things with buffers and 
child data). For types where this does not happen (such as `StructArray`, 
`DictionaryArray` and `ListArray`, where data also resides on their specific 
struct implementations instead of `ArrayData`), I believe that more work is 
needed. IMO we should strive to have all data in `ArrayData`, as it makes it 
significantly easier to export it via the C Data Interface, as well as 
understanding what they physically contain.
   
   Some design choices:
   
   * import and export does not care about the type of the data that is in 
memory (previously `BufferData`, now `Bytes`) - it only cares about how they 
should be converted from and to `ArrayData` to the C data interface.
   * import wraps incoming pointers on a struct behind an `Arc`, so that we 
thread-safely refcount them and can share them between buffers, arrays, etc.
   * `export` places `Buffer`s in `private_data` for bookkeeping and release 
them when the consumer releases it via `release`.
   
   I do not expect this PR to be easy to review, as it is touching sensitive 
(aka `unsafe`) code. However, based on the tests I did so far, I am 
sufficiently happy to PR it. I tried to comment as much as possible, which I 
will continue to do so in the following commits.
   
   This PR has three main parts:
   
   1. Addition of an `ffi` module that contains the import and export 
functionality
   2. Add some helpers to import and export an Array from C Data Interface
   3. A crate to test this against Python/C++'s API
   
   It also does a small refactor of `BufferData`, renaming it to `Bytes` 
(motivated by the popular `bytes` crate), and moving it to a separate file.
   
   What is tested:
   
   * round-trip `Python -> Rust -> Python` (new separate crate, 
`arrow-c-integration`)
   * round-trip `Rust -> Rust -> Rust`
   
   What is not tested yet:
   
   * round-trip `Rust -> Python -> Rust`
   * memory allocation (I am still trying to find a way of doing this in Rust)
   
   Things to do:
   
   * [ ] CI for the Python tests: it requires different compilation flags, 
which requires compiling the whole thing. I `excluded` it from the workspace as 
it does not behave well with rust-analyzer, but we need to add it to the CI 
nevertheless.
   * [ ] Add more comments
   * [ ] Add more tests
   * [ ] Extend for all types that only use buffers
   * [ ] Error on types that are not supported
   
   Finally, this PR has a large contribution of @pitrou , that took _a lot_ of 
his time to explain to me how the C++ was doing it and the main things that I 
had to worry about here.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to