2010YOUY01 commented on issue #22946: URL: https://github.com/apache/datafusion/issues/22946#issuecomment-4713778114
> I am also open to exploring whether the spill/streaming work should be integrated with `BoundedWindowAggExec`, especially for bounded frames as mentioned above. My hesitation is that `BoundedWindowAggExec` already has a specialized in-memory state/pruning model, so disk-backed state there likely deserves a separate focused design rather than being mixed into the initial spill PR. Here are some quick ideas. I may not have explained everything clearly yet, but I’ll put together an epic issue for improving window functions to better explain the direction. I think `BoundedWindowAggExec` should eventually be deprecated in favor of a new streaming implementation. My concern is that it assumes the input may not be fully ordered by group key and partition key, and that assumption gets in the way of a more efficient implementation. So my preference would be to move directly toward a better streaming implementation, rather than adding an intermediate spilling-based step. The workloads that a streaming approach cannot fully solve are: - a single partition that does not fit in memory - and, a window frame that moves unpredictably from row to row Those cases likely need an LRU-like algorithm, but I don’t think that should be the current priority. Since the window operator is still fairly basic at the moment, I think we should make the in-memory cases better first. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
