leventov commented on PR #43632: URL: https://github.com/apache/arrow/pull/43632#issuecomment-2311887589
I'd suggest that to help resolve the callbacks vs. tasks dilemma, someone could take several representative examples of processing engines that are going to be users of this API, such as DataFusion, Velox, ClickHouse, Doris, or the like, and see whether callbacks or tasks fit better into their threading/task execution model. ### ClickHouse For example, here's the ClickHouse's model: https://github.com/ClickHouse/ClickHouse/blob/6289c65e0286127689303b5b7a543212ca38e0c7/src/Processors/IProcessor.h -- Clearly, it would work better with a task-based Arrow Async handler (it would be a "source" IProcessor, as they call IProcessors with a single "input port"). > My big concern is the case of an IPC stream where we don't know beforehand how many record batches are in the stream. If get_next_task is called multiple times in a row, historically we've used a return of 0 + null for the Array to indicate the end of the stream. So the only reasonable way to handle get_next_task being called multiple times in a row while there are tasks pending is to return a valid task for each call until the producer knows that the stream has ended and then we define semantics for the task struct to indicate that the stream has ended successfully. @zeroshade FWIW please compare with how IProcessor handles this: `IProcessor.prepare()` should return `Status.Finished`. IProcessor also has explicit `.cancel()` callback. ### Other systems Could someone please help with research for other systems? I don't suggest copying any specific system's approach verbatim, but probably taking the approach that is most compatible with most systems' approaches is best. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
