[GitHub] [arrow-datafusion] milenkovicm commented on issue #1570: Memory Limited GroupBy (Externalized / Spill)

via GitHub Wed, 26 Apr 2023 02:39:11 -0700


milenkovicm commented on issue #1570:
URL: 
https://github.com/apache/arrow-datafusion/issues/1570#issuecomment-1523105149


   I'm not sure I see many benefits of having it serializable, would agree with 
@crepererum 
   Now this discussion would make more sense if we would know more about your 
implementation. 
   
   IMHO, aggregation should start with hash map, we can assume that there is 
not going to be spill, if we're wrong we would pay penalty of being wrong as we 
will have to sort it before spill. 
   
   Once we have it spill to disc I'd argue it would make more sense to switch 
from hash map to b-tree, as we would need to merge it with spill, it is slower 
but from my experience it is a bit faster than sorting hash map. 
   
   Spilling can be implemented using two column parquet file (key: blob, value: 
blob) . 
   
   Implementation like this works quite well from my experience, especially 
that in most cases we wont trigger spill 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] milenkovicm commented on issue #1570: Memory Limited GroupBy (Externalized / Spill)

Reply via email to