[GitHub] [arrow] kkraus14 commented on a diff in pull request #34972: GH-34971: [Format] Enhance C-Data API to support non-cpu cases

via GitHub Wed, 12 Apr 2023 08:59:26 -0700


kkraus14 commented on code in PR #34972:
URL: https://github.com/apache/arrow/pull/34972#discussion_r1164340819



##########
cpp/src/arrow/c/abi.h:
##########
@@ -106,6 +212,98 @@ struct ArrowArrayStream {
 
 #endif  // ARROW_C_STREAM_INTERFACE
 
+#ifndef ARROW_C_DEVICE_STREAM_INTERFACE
+#define ARROW_C_DEVICE_STREAM_INTERFACE
+
+/// \brief Equivalent to ArrowArrayStream, but for ArrowDeviceArrays.
+///
+/// This stream is intended to provide a stream of data on a single
+/// device, if a producer wants data to be produced on multiple devices
+/// then multiple streams should be provided. One per device.
+struct ArrowDeviceArrayStream {
+  /// \brief The device that this stream produces data on.
+  ///
+  /// All ArrowDeviceArrays that are produced by this
+  /// stream should have the same device_type as set
+  /// here. The device_type needs to be provided here
+  /// so that consumers can provide the correct type
+  /// of queue_ptr when calling get_next.
+  ArrowDeviceType device_type;
+
+  /// \brief Callback to get the stream schema
+  /// (will be the same for all arrays in the stream).
+  ///
+  /// If successful, the ArrowSchema must be released independantly from the 
stream.
+  /// The schema should be accessible via CPU memory.
+  ///
+  /// \param[in] self The ArrowDeviceArrayStream object itself
+  /// \param[out] out C struct to export the schema to
+  /// \return 0 if successful, an `errno`-compatible error code otherwise.
+  int (*get_schema)(struct ArrowDeviceArrayStream* self, struct ArrowSchema* 
out);
+
+  /// \brief Callback to get the device id for the next array.
+  ///
+  /// This is necessary so that the proper/correct stream pointer can be 
provided
+  /// to get_next.
+  ///
+  /// The next call to `get_next` should provide an ArrowDeviceArray whose
+  /// device_id matches what is provided here, and whose device_type is the
+  /// same as the device_type member of this stream.
+  ///
+  /// \param[in] self The ArrowDeviceArrayStream object itself
+  /// \param[out] out_device_id Pointer to be populated with the device id, 
must not be
+  /// null \return 0 if successful, an `errno`-compatible error code otherwise.
+  int (*get_next_device_id)(struct ArrowDeviceArrayStream* self, int64_t* 
out_device_id);
+
+  /// \brief Callback to get the next array
+  ///
+  /// If there is no error and the returned array has been released, the stream
+  /// has ended. If successful, the ArrowArray must be released independently
+  /// from the stream.
+  ///
+  /// Because different frameworks use different types to represent this, we
+  /// accept a void* which should then be reinterpreted into whatever the
+  /// appropriate type is (e.g. cudaStream_t) for use by the producer.
+  ///
+  /// \param[in] self The ArrowDeviceArrayStream object itself
+  /// \param[in] queue_ptr The appropriate queue, stream, or
+  /// equivalent object for the device that the data is allocated on
+  /// to indicate where the consumer wants the data to be accessible.
+  /// If queue_ptr is NULL then the default stream (e.g. CUDA stream 0)
+  /// should be used to ensure that the memory is accessible from any stream.

Review Comment:
   > Why wouldn't they? They can easily refcount the usage of their own CUDA 
streams.
   
   I think that is making a lot of assumptions about how folks use and manage 
CUDA streams 😄. Again, some places use them similarly to thread pools and only 
control the lifetime of the pool.
   
   I tried to dig through Tensorflow's code to figure exactly how they're 
managing the lifetime of their streams but I'm not confident, everything I say 
below may not be correct:
   
   - Something eventually calls down to `AllocateStream` and `DeallocateStream` 
(https://github.com/tensorflow/tensorflow/blob/b9fc6a9b611ec373c02e5b5ab432b1d7aff9392e/tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc#L759-L774)
 to create and destroy CUDA streams.
   - These operate on raw ptrs and it looks like there's a class that wraps 
these, `Stream` 
(https://github.com/tensorflow/tensorflow/blob/b9fc6a9b611ec373c02e5b5ab432b1d7aff9392e/tensorflow/compiler/xla/stream_executor/stream.cc#L262-L286)
 which has constructor, destructor, and init functions roughly of what you'd 
expect.
   - I believe these `Stream` objects are managed in a `StreamPool` 
(https://github.com/tensorflow/tensorflow/blob/b9fc6a9b611ec373c02e5b5ab432b1d7aff9392e/tensorflow/compiler/xla/service/stream_pool.h#L27-L59)
 which then allows "borrowing" the streams using unique ptrs.
   
   I guess in theory that if they ultimately have `Stream` objects being used 
that it could be moved into the private data being used by the release callback.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] kkraus14 commented on a diff in pull request #34972: GH-34971: [Format] Enhance C-Data API to support non-cpu cases

Reply via email to