Re: [I] [Python] Expose the device interface through the Arrow PyCapsule protocol [arrow]

via GitHub Wed, 06 Mar 2024 07:00:44 -0800


jorisvandenbossche commented on issue #38325:
URL: https://github.com/apache/arrow/issues/38325#issuecomment-1981062328


   I would like to revive this, now there is some movement in exposing the 
Device Interface in pyarrow (low level bindings have been added, 
https://github.com/apache/arrow/issues/39979) and in supporting it in cudf 
(https://github.com/rapidsai/cudf/pull/15047).
   
   Concretely, I think adding a separate set of dunder methods, 
`__arrow_c_device_array__` and `__arrow_c_device_stream__` (option 2 from the 
top post), would be the best option moving forward. 
   It's not that it are entirely separate methods. For a producer/consumer that 
can handle the device interface, most of the code will be shared with the code 
required to handle the standard C interface, given that the C Device Interface 
is only a small extension on top of the C Interface struct. But keeping the 
dunder methods separate on the Python side allows libraries that only support 
the C Data Interface to still implement that part of the PyCapsule protocol. 
The _absence_ of the device versions of the dunder methods is then also an 
indication that this producer only supports CPU data.
   
   The actual methods can mimic the current ones (see 
https://arrow.apache.org/docs/dev/format/CDataInterface/PyCapsuleInterface.html#arrowarray-export),
 just with adapted names:
   
   * `__arrow_c_device_array__` returns a pair of PyCapsules containing a C 
ArrowSchema and ArrowDeviceArray, where the latter uses `"arrow_device_array"` 
for the capsule name
   * `__arrow_c_device_stream__` returns a PyCapsule containing a C 
ArrowDeviceArrayStream, where the capsule must have a name of 
"arrow_device_array_stream" 
   
   And both methods can then similarly also accept a `requested_schema` keyword.
   
   Some questions:
   
   
   * My understanding is that in the Python API for the dunder methods, we 
don't need to expose anything to deal with the `sync_event`, as that is 
entirely handled by the consumer through the struct itself (so in contrast with 
the DLPack [`__dlpack__` 
method](https://data-apis.org/array-api/2022.12/API_specification/generated/array_api.array.__dlpack__.html)
 which does have a `stream` keyword. But they way how the even works, this is 
not needed). This is correct?
   * Do we want to add an `__arrow_c_device__` method that returns the device 
of the array or stream? 
     This would be similar as the [`__dlpack_device__` from 
DLPack](https://data-apis.org/array-api/2022.12/API_specification/generated/array_api.array.__dlpack_device__.html),
 and I suppose essentially just an easier way to quickly check the device of 
the object. 
     For DLPack, one reason to have this is to check the device before you can 
pass the correct stream to `__dlpack__` (according to 
https://dmlc.github.io/dlpack/latest/python_spec.html#syntax-for-data-interchange-with-dlpack),
 and that's of course not a relevant reason for us (based on the previous 
bullet point)
   * Similarly as with the PyCapsule protocol we already added, I would for now 
not specify anything about how the consumer side should look like (in contrast 
to DLPack which has a 
[`from_dlpack`](https://data-apis.org/array-api/2022.12/API_specification/generated/array_api.from_dlpack.html)
 specified through the Array API). That also means it is up to the consumer 
library how to deal with device copies etc (although I assume typically a copy 
will be avoided unless explicitly asked?)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [Python] Expose the device interface through the Arrow PyCapsule protocol [arrow]

Reply via email to