zhuqi-lucas commented on PR #16322: URL: https://github.com/apache/datafusion/pull/16322#issuecomment-2961564872
> > Previously, as soon as one stream returned Pending, the merge would short-circuit and return Pending, minimizing work per cycle. With the new approach, we now poll all N streams once before giving up, which can add extra CPU cost, especially when you have high parallelism > > @zhuqi-lucas this is a really interesting topic and tricky tradeoff. The previous code would indeed do one poll per cycle and then do a self-wake yield. The new code polls each stream once and then yields without self-wake. In terms of the amount of work done there is no difference, not any extra CPU cost. On the contrary, the self-wake yield is removal of pure waste. Unless I'm missing something, total elapsed time should be less. > > But the tradeoff is that we're going to extend the elapsed time per cycle. As a consequence it may take a bit longer to yield to the caller, and then the cancellation problem rears its ugly head again. > > If this actually matters or not is entirely dependent on the streams being polled. SortPreservingMergeExec for instance lets things up to use `RecordBatchReceiverStream` instances as children. Polling those repeatedly until the first record batch arrives is quite wasteful because you're really just spinning in place. Checking each receiver in a loop is going to be super quick, and yielding in between every partition is not useful at all. If the child streams are blocking though, then it's a different matter. You probably don't want to actually drive the progress of each stream synchronously in a loop. > > So... it's a complicated balancing act. Would love to hear how others look at this problem. PR #16319 ties into this. It's an experiment I'm doing to see if we can avoid exposing the potentially blocking portion of streams to the caller so that the problem described above kind of disappears. It's not yet clear if this can be achieved without a performance penalty. Thank you @pepijnve , some trade-off solution may be: ```rust Trade-off Strategy : Batch Polling Approach that sits between the two extremes is batch polling. Instead of polling all streams every time, you divide the N child streams into several groups (e.g., 3 groups). On each poll cycle, you only poll one group of streams. On the next cycle, you move to the next group, and so on. This helps reduce the per-cycle polling cost while still making steady progress across all streams over time. Especially in high-parallelism scenarios where polling all streams on every wake-up is unnecessarily expensive. ``` And our TPC-H is not enough i believe, we may need to mock those cases... But it's really hard. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org