agavra commented on issue #9615: URL: https://github.com/apache/pinot/issues/9615#issuecomment-1281783137
I think we should consider our thread utilization end-to-end. Specifically, in our multistage engine if we consider all the operators between a `MailboxReceiveOperator` and a `MailboxSendOperator` as a single "stage" we need to make sure that the stage only runs when there's work to be done and that the threads that are doing that work are efficiently shared between potentially many concurrent queries. While investigating #9563 I discovered that in order to really get non-blocking behavior we need the entire stage to be non-blocking (e.g. the entire stage should be deferred while there's no work to do, so that the thread pool running it can be freed up to schedule another task) Ideally, I'd suggest something like:  This gives us a few improvements over the existing architecture. 1. the obvious advantage is that it's reactive and will only stage execution when there's work to be done (we don't block threads/work pools unless work is being done) 2. it allows us to control memory and thread utilization across multiple queries that are running on a single node, while [applying backpressure](https://github.com/grpc/grpc-java/blob/master/examples/src/main/java/io/grpc/examples/manualflowcontrol/ManualFlowControlServer.java) on servers if we aren't able to keep up (e.g. we're doing some complex aggregation on the intermediate stage nodes) 3. I'm not totally sure this is possible with GRPC the way it's written today, but if we could figure out a way to reuse byte buffers from incoming `MailboxContent` we could likely be more efficient in terms of GC 4. Debug-ability and observability are pretty awesome if this is done correctly because you can easily introspect each part of the pipeline (data was received, a task was scheduled, a task completed the data that was passed to it, a task is idle) We could certainly do this in stages - commenting on your specific considerations: > consolidate multiple ArrayBlockingQueue in each observer into just sharing a single queue thus the mailbox receive operator waits for signals from the blocking queue offer I think this is exactly right - I'd suggest we even consolidate it across queries (and then use a simple in-memory scheduler to forward the right data to the right tasks) but that's probably a bigger change. > create a signaling mechanism to info mailbox receive operator thread to start reading directly from the notifying observer also +1, but the subtle difference with what I suggest here is that it doesn't signal to the mailbox receive operator specifically, but rather the executor that schedules the receiver operator in first place. basically, there would be no mailbox receive operator thread allocated unless there's work to be done. Happy to start prototyping some of this stuff up :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
