agavra commented on issue #9615:
URL: https://github.com/apache/pinot/issues/9615#issuecomment-1281783137

   I think we should consider our thread utilization end-to-end. Specifically, 
in our multistage engine if we consider all the operators between a 
`MailboxReceiveOperator` and a `MailboxSendOperator` as a single "stage" we 
need to make sure that the stage only runs when there's work to be done and 
that the threads that are doing that work are efficiently shared between 
potentially many concurrent queries.
   
   While investigating #9563 I discovered that in order to really get 
non-blocking behavior we need the entire stage to be non-blocking (e.g. the 
entire stage should be deferred while there's no work to do, so that the thread 
pool running it can be freed up to schedule another task)
   
   Ideally, I'd suggest something like:
   
   
![image](https://user-images.githubusercontent.com/3172405/196330399-4edaf31c-8de7-45f0-8a72-3d5814ffd440.png)
   
   
   This gives us a few improvements over the existing architecture.
   1. the obvious advantage is that it's reactive and will only stage execution 
when there's work to be done (we don't block threads/work pools unless work is 
being done)
   2. it allows us to control memory and thread utilization across multiple 
queries that are running on a single node, while [applying 
backpressure](https://github.com/grpc/grpc-java/blob/master/examples/src/main/java/io/grpc/examples/manualflowcontrol/ManualFlowControlServer.java)
 on servers if we aren't able to keep up (e.g. we're doing some complex 
aggregation on the intermediate stage nodes)
   3. I'm not totally sure this is possible with GRPC the way it's written 
today, but if we could figure out a way to reuse byte buffers from incoming 
`MailboxContent` we could likely be more efficient in terms of GC
   4. Debug-ability and observability are pretty awesome if this is done 
correctly because you can easily introspect each part of the pipeline (data was 
received, a task was scheduled, a task completed the data that was passed to 
it, a task is idle)
   
   We could certainly do this in stages - commenting on your specific 
considerations:
   
   > consolidate multiple ArrayBlockingQueue in each observer into just sharing 
a single queue thus the mailbox receive operator waits for signals from the 
blocking queue offer
   
   I think this is exactly right - I'd suggest we even consolidate it across 
queries (and then use a simple in-memory scheduler to forward the right data to 
the right tasks) but that's probably a bigger change.
   
   > create a signaling mechanism to info mailbox receive operator thread to 
start reading directly from the notifying observer
   
   also +1, but the subtle difference with what I suggest here is that it 
doesn't signal to the mailbox receive operator specifically, but rather the 
executor that schedules the receiver operator in first place. basically, there 
would be no mailbox receive operator thread allocated unless there's work to be 
done.
   
   Happy to start prototyping some of this stuff up :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to