Dandandan commented on PR #23026: URL: https://github.com/apache/datafusion/pull/23026#issuecomment-4847491222
> In terms of evaluating window functions with more memory efficiency, there is this paper which is the new hotness that describes segment trees > > * https://www.vldb.org/pvldb/vol8/p1058-leis.pdf I had a look at segment trees but I think they might not be _that_ great if for performance optimization of existing functionality: * Segment Tree is actually slower algorithmically and in practice than _removable cumulative_ (which DataFusion uses/supports, i.e. `retract_batch`) on some functions (e.g. SUM), see Table 1 and Figure 10: <img width="962" height="176" alt="image" src="https://github.com/user-attachments/assets/3888d868-28de-41de-8f1e-729cb1c5decc" /> <img width="514" height="364" alt="image" src="https://github.com/user-attachments/assets/7cf6b5ba-f345-453a-9328-c63b4499adc3" /> * The paper also agrees segment trees has overhead and other approaches should be taken in other cases * They are mainly better for _variable frame bounds_ (e.g. `between column_1 and column_2 preceding` ), which is something DataFusion does not support yet (although arguably adding it might require segment trees or something similar to add that). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
