[GitHub] [pinot] siddharthteotia commented on issue #9615: Support non-blocking MailboxReceivedOperator

GitBox Tue, 18 Oct 2022 07:58:40 -0700


siddharthteotia commented on issue #9615:
URL: https://github.com/apache/pinot/issues/9615#issuecomment-1282532520


   May be this is also a good time / opportunity to rethink from the operator 
level. Here are some thoughts...
   
   - Think of operator chain as a pipeline and data can be pumped up / down in 
/ out of operators. 
   - Some operators can produce / output data (technically data blocks / record 
batches) upon input / consume. Example PROJECT.
   - Some operators need to consume everything before produce / output data. 
Example GROUP BY
   
   Operators have a state machine something along the lines .....
   
   - SETUP / INIT
   - CAN_CONSUME
   - CAN_CONSUME_FROM_LEFT
   - CAN_CONSUME_FROM_RIGHT
   - CAN_PRODUCE
   - BLOCKED
   - DONE
   
   Operator abstraction with sub-interfaces
   
   - Producer (e.g API produceData, outputData)
   - SingleConsumer (e.g API consumeData, noMoreToConsume). Can also be a 
Producer
   - DualConsumer (e.g HashJoin). Can also be a Producer
   
   Pipe:
   
   - An abstraction representing a collection of operators to enable pumping 
data between operators
   - provides method pump()
   - Works with source / sink concept for upstream / downstream
   - pump() can return 
   -- DONE essentially also setting noMoreToConsume on sink because upstream is 
done producing
   -- PUMPED / IN_PROGRESS (upstream / source has output batch for downstream / 
sink to consume)
   -- UPSTREAM_NOT_READY if sink is CAN_CONSUME, but source is not ready to 
produce
   -- DOWNSTREAM_NOT_READY if sink itself is blocked or is holding batches that 
it is yet to produce / output and thus can't consume more (yet) from upstream
   
   During execution planning when physical operators are created is when we 
setup the Pipeline.
   
   > we need to make sure that the stage only runs when there's work to be done 
and that the threads that are doing that work are efficiently shared between 
potentially many concurrent queries
   
   +1
   
   - Yes, I think we need executors at the inner stage level to execute the 
pipelines inside the Stage. 
   - Each such executor will execute a pipeline inside a single thread. It 
calls pump() on the pipeline (described above).
   - It has the capability to be scheduled, rescheduled, yield etc. 
   - Through the state machine, we need to know when work is available 
   - We need to wake up this executor when work is available to be done for the 
pipeline
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [pinot] siddharthteotia commented on issue #9615: Support non-blocking MailboxReceivedOperator

Reply via email to