[GitHub] [arrow-datafusion] crepererum commented on issue #1570: Memory Limited GroupBy (Externalized / Spill)

via GitHub Tue, 25 Apr 2023 07:33:21 -0700


crepererum commented on issue #1570:
URL: 
https://github.com/apache/arrow-datafusion/issues/1570#issuecomment-1521895443


   > Can we make the `GroupState` and the Accumulator states serializable ? 
With this approach, we do not need to do any sort when spiiling data to disks. 
And when we read the data back, we reconstruct our raw hash table quickly from 
the hash values and indexes, because our hashmap is very lightweight, the hash 
value can be re-calculated from grouping rows, or we can cache the hash value 
inside the `GroupState` to avoid the re-calculating.
   
   You still need to disk spilling, no? Or where do you store the serialized 
state? Also I guess that serialization may become a major bottleneck for some 
of the accumulators.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] crepererum commented on issue #1570: Memory Limited GroupBy (Externalized / Spill)

Reply via email to