alamb commented on issue #1570:
URL: 
https://github.com/apache/arrow-datafusion/issues/1570#issuecomment-1525937016

   > Can we make the GroupState and the Accumulator states serializable ?
   > With this approach, we do not need to do any sort when spiiling data to 
disks. And when we read the data back, we reconstruct our raw hash table 
quickly from the hash values and indexes, because our hashmap is very 
lightweight, the hash value can be re-calculated from grouping rows, or we can 
cache the hash value inside the GroupState to avoid the re-calculating.
   
   @mingmwang  when we serialize the groups to disk when  the hash table is 
full, we then need to read them back in again somehow and do the final 
combination. <y assumption is that we don't have the memory to read them all 
back in at once as we had to spill in the first place
   
   If we sort the data that is spilled on the group keys, we can  stream data 
from the different spill files in parallel, merge, and then do a streaming 
group by, which we will have sufficient memory to accomplished. 
   
   > IMHO, aggregation should start with hash map, we can assume that there is 
not going to be spill, if we're wrong we would pay penalty of being wrong as we 
will have to sort it before spill.
   
   Yes I agree with @milenkovicm - this approach will work well. It does have a 
disadvantage of a performance "cliff" when the query goes from in memory --> 
spilling (it doesn't degrade gracefully) but it is likely the fastest for 
queries that have sufficient memory 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to