jorisvandenbossche commented on issue #38325: URL: https://github.com/apache/arrow/issues/38325#issuecomment-1981062328
I would like to revive this, now there is some movement in exposing the Device Interface in pyarrow (low level bindings have been added, https://github.com/apache/arrow/issues/39979) and in supporting it in cudf (https://github.com/rapidsai/cudf/pull/15047). Concretely, I think adding a separate set of dunder methods, `__arrow_c_device_array__` and `__arrow_c_device_stream__` (option 2 from the top post), would be the best option moving forward. It's not that it are entirely separate methods. For a producer/consumer that can handle the device interface, most of the code will be shared with the code required to handle the standard C interface, given that the C Device Interface is only a small extension on top of the C Interface struct. But keeping the dunder methods separate on the Python side allows libraries that only support the C Data Interface to still implement that part of the PyCapsule protocol. The _absence_ of the device versions of the dunder methods is then also an indication that this producer only supports CPU data. The actual methods can mimic the current ones (see https://arrow.apache.org/docs/dev/format/CDataInterface/PyCapsuleInterface.html#arrowarray-export), just with adapted names: * `__arrow_c_device_array__` returns a pair of PyCapsules containing a C ArrowSchema and ArrowDeviceArray, where the latter uses `"arrow_device_array"` for the capsule name * `__arrow_c_device_stream__` returns a PyCapsule containing a C ArrowDeviceArrayStream, where the capsule must have a name of "arrow_device_array_stream" And both methods can then similarly also accept a `requested_schema` keyword. Some questions: * My understanding is that in the Python API for the dunder methods, we don't need to expose anything to deal with the `sync_event`, as that is entirely handled by the consumer through the struct itself (so in contrast with the DLPack [`__dlpack__` method](https://data-apis.org/array-api/2022.12/API_specification/generated/array_api.array.__dlpack__.html) which does have a `stream` keyword. But they way how the even works, this is not needed). This is correct? * Do we want to add an `__arrow_c_device__` method that returns the device of the array or stream? This would be similar as the [`__dlpack_device__` from DLPack](https://data-apis.org/array-api/2022.12/API_specification/generated/array_api.array.__dlpack_device__.html), and I suppose essentially just an easier way to quickly check the device of the object. For DLPack, one reason to have this is to check the device before you can pass the correct stream to `__dlpack__` (according to https://dmlc.github.io/dlpack/latest/python_spec.html#syntax-for-data-interchange-with-dlpack), and that's of course not a relevant reason for us (based on the previous bullet point) * Similarly as with the PyCapsule protocol we already added, I would for now not specify anything about how the consumer side should look like (in contrast to DLPack which has a [`from_dlpack`](https://data-apis.org/array-api/2022.12/API_specification/generated/array_api.from_dlpack.html) specified through the Array API). That also means it is up to the consumer library how to deal with device copies etc (although I assume typically a copy will be avoided unless explicitly asked?) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
