siddharthteotia commented on issue #9615: URL: https://github.com/apache/pinot/issues/9615#issuecomment-1282532520
May be this is also a good time / opportunity to rethink from the operator level. Here are some thoughts... - Think of operator chain as a pipeline and data can be pumped up / down in / out of operators. - Some operators can produce / output data (technically data blocks / record batches) upon input / consume. Example PROJECT. - Some operators need to consume everything before produce / output data. Example GROUP BY Operators have a state machine something along the lines ..... - SETUP / INIT - CAN_CONSUME - CAN_CONSUME_FROM_LEFT - CAN_CONSUME_FROM_RIGHT - CAN_PRODUCE - BLOCKED - DONE Operator abstraction with sub-interfaces - Producer (e.g API produceData, outputData) - SingleConsumer (e.g API consumeData, noMoreToConsume). Can also be a Producer - DualConsumer (e.g HashJoin). Can also be a Producer Pipe: - An abstraction representing a collection of operators to enable pumping data between operators - provides method pump() - Works with source / sink concept for upstream / downstream - pump() can return -- DONE essentially also setting noMoreToConsume on sink because upstream is done producing -- PUMPED / IN_PROGRESS (upstream / source has output batch for downstream / sink to consume) -- UPSTREAM_NOT_READY if sink is CAN_CONSUME, but source is not ready to produce -- DOWNSTREAM_NOT_READY if sink itself is blocked or is holding batches that it is yet to produce / output and thus can't consume more (yet) from upstream During execution planning when physical operators are created is when we setup the Pipeline. > we need to make sure that the stage only runs when there's work to be done and that the threads that are doing that work are efficiently shared between potentially many concurrent queries +1 - Yes, I think we need executors at the inner stage level to execute the pipelines inside the Stage. - Each such executor will execute a pipeline inside a single thread. It calls pump() on the pipeline (described above). - It has the capability to be scheduled, rescheduled, yield etc. - Through the state machine, we need to know when work is available - We need to wake up this executor when work is available to be done for the pipeline -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
