leventov commented on PR #43632:
URL: https://github.com/apache/arrow/pull/43632#issuecomment-2311887589

   I'd suggest that to help resolve the callbacks vs. tasks dilemma, someone 
could take several representative examples of processing engines that are going 
to be users of this API, such as DataFusion, Velox, ClickHouse, Doris, or the 
like, and see whether callbacks or tasks fit better into their threading/task 
execution model.
   
   ### ClickHouse
   
   For example, here's the ClickHouse's model: 
https://github.com/ClickHouse/ClickHouse/blob/6289c65e0286127689303b5b7a543212ca38e0c7/src/Processors/IProcessor.h
   
   -- Clearly, it would work better with a task-based Arrow Async handler (it 
would be a "source" IProcessor, as they call IProcessors with a single "input 
port").
   
   > My big concern is the case of an IPC stream where we don't know beforehand 
how many record batches are in the stream. If get_next_task is called multiple 
times in a row, historically we've used a return of 0 + null for the Array to 
indicate the end of the stream. So the only reasonable way to handle 
get_next_task being called multiple times in a row while there are tasks 
pending is to return a valid task for each call until the producer knows that 
the stream has ended and then we define semantics for the task struct to 
indicate that the stream has ended successfully.
   
   @zeroshade FWIW please compare with how IProcessor handles this: 
`IProcessor.prepare()` should return `Status.Finished`.
   
   IProcessor also has explicit `.cancel()` callback.
   
   ### Other systems
   
   Could someone please help with research for other systems?
   
   I don't suggest copying any specific system's approach verbatim, but 
probably taking the approach that is most compatible with most systems' 
approaches is best.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to