alamb commented on issue #23197:
URL: https://github.com/apache/datafusion/issues/23197#issuecomment-4835392373

   > Here is my guess on why such design decision was made before: because the 
original window aggregation implementation was fairly naive: it buffered all 
input before continuing execution. As a result, a downstream implementation was 
upstreamed to support streaming and out-of-order input, where assuming fully 
sorted input was not possible for their use cases. Given that goal, the 
additional complexity made sense because it improved memory efficiency.
   
   Yes I think that is mostly accurate (I think @akurmustafa / @mustafasrepo  
did a lot of the initial work)
   
   I think `BoundedWindowExec` was designed to support streaming sql engines 
(e.g. where the ExecutionPlan was started and the incrementally produced 
results as data was incrementally pushed through it)
   
    > However, DataFusion's primary use case is still batched analytics. Under 
that assumption, we can require the window operator to consume input that has 
already been repartitioned and sorted by the window partition/order keys. This 
makes the implementation both simpler and faster, and we can achieve the same 
memory efficiency given bounded window frames.
   
   I agree it makes the implementation simpler -- I suspect a fully specialized 
operator could be made faster than spilling / resorting, but I also think we 
could work on that after we have exhausted all options for making 
BoundedWindowAggExec faster


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to