westonpace commented on PR #43632:
URL: https://github.com/apache/arrow/pull/43632#issuecomment-2282026639

   Ah, one more potential approach.  This one is actually my favorite for a 
file(s) scanner.  It has all the advantages of a pull-based approach plus it 
puts the consumer in complete control of deciding how much decode parallelism 
there should be (by calling `next_task` before the previous call to `next_task` 
has completed).  On the other hand, it can make it more difficult if the 
producer does not have any control over the parallelism or know the size of the 
stream ahead-of-time (e.g. maybe it is receiving data from a TCP stream):
   
   ```
   struct Waker {
     // Signal to the producer that more data is available, the consumer shall 
release any resources
     // associated with the waker after this method is called.  The producer 
shall not call any other
     // methods after calling this method.
     //
     // The producer must always call this method even if an error is 
encountered (in which case the
     // error will be reported on the following call to `get_next`).
     void wake(Waker* waker); 
     void* private_data;
   };
   
   struct ArrowArrayTask {
     // Callback to get the array
     // (if no error and the array is released, the stream has ended)
     //
     // Return value: 0 if successful, an `errno`-compatible error code 
otherwise.
     //
     // If EWOULDBLOCK is returned then the array is not ready.  If the
     // producer returns this value then the producer takes ownership of 
`waker` and is responsible
     // for calling `wake` when more data is available.  If any other value is 
returned then the producer
     // shall ignore `waker` and never call any methods on it.
     //
     // If successful, the ArrowDeviceArray must be released independently from 
the stream.
     int (*get_next)(struct ArrowArrayTask* self, Waker* waker, struct 
ArrowDeviceArray* out);
     void (*release)(struct ArrowArrayTask* self);
     void* ArrowArrayTask;
   };
   
   struct ArrowAsyncDeviceArrayStream {
     // Callback to get the next array task
     // (if no error and the task is released, the stream has ended)
     //
     // Return value: 0 if successful, an `errno`-compatible error code 
otherwise.
     //
     // The consumer is allowed to call this method again even if the tasks 
returned by a previous
     // call have not completed.  However, the consumer shall not call this 
re-entrantly.
     //
     // If successful, the ArrowDeviceArray must be released independently from 
the stream.
     int (*get_next_task)(struct ArrowAsyncDeviceArrayStream* self, struct 
ArrowArrayTask* task);
   
     // The rest is identical to `ArrowDeviceArrayStream`
     ArrowDeviceType device_type;
     const char* (*get_last_error)(struct ArrowAsyncDeviceArrayStream* self);
     void (*release)(struct ArrowAsyncDeviceArrayStream* self);
     void* private_data;
   };
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to