westonpace commented on PR #43632: URL: https://github.com/apache/arrow/pull/43632#issuecomment-2282026639
Ah, one more potential approach. This one is actually my favorite for a file(s) scanner. It has all the advantages of a pull-based approach plus it puts the consumer in complete control of deciding how much decode parallelism there should be (by calling `next_task` before the previous call to `next_task` has completed). On the other hand, it can make it more difficult if the producer does not have any control over the parallelism or know the size of the stream ahead-of-time (e.g. maybe it is receiving data from a TCP stream): ``` struct Waker { // Signal to the producer that more data is available, the consumer shall release any resources // associated with the waker after this method is called. The producer shall not call any other // methods after calling this method. // // The producer must always call this method even if an error is encountered (in which case the // error will be reported on the following call to `get_next`). void wake(Waker* waker); void* private_data; }; struct ArrowArrayTask { // Callback to get the array // (if no error and the array is released, the stream has ended) // // Return value: 0 if successful, an `errno`-compatible error code otherwise. // // If EWOULDBLOCK is returned then the array is not ready. If the // producer returns this value then the producer takes ownership of `waker` and is responsible // for calling `wake` when more data is available. If any other value is returned then the producer // shall ignore `waker` and never call any methods on it. // // If successful, the ArrowDeviceArray must be released independently from the stream. int (*get_next)(struct ArrowArrayTask* self, Waker* waker, struct ArrowDeviceArray* out); void (*release)(struct ArrowArrayTask* self); void* ArrowArrayTask; }; struct ArrowAsyncDeviceArrayStream { // Callback to get the next array task // (if no error and the task is released, the stream has ended) // // Return value: 0 if successful, an `errno`-compatible error code otherwise. // // The consumer is allowed to call this method again even if the tasks returned by a previous // call have not completed. However, the consumer shall not call this re-entrantly. // // If successful, the ArrowDeviceArray must be released independently from the stream. int (*get_next_task)(struct ArrowAsyncDeviceArrayStream* self, struct ArrowArrayTask* task); // The rest is identical to `ArrowDeviceArrayStream` ArrowDeviceType device_type; const char* (*get_last_error)(struct ArrowAsyncDeviceArrayStream* self); void (*release)(struct ArrowAsyncDeviceArrayStream* self); void* private_data; }; ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org