As Serge mentioned, we’ve implemented spill-to-disk for SetOps and Aggregates 
at Salesforce. We were hitting OOMs often enough that this became a high 
priority for us. However, our current spill implementation is based on dynahash 
from 9.6, and we’re not happy with its performance (it was primarily an OOM 
stop-gap and was not focused on optimizing performance).  

Because of this, we’ve spec’d out a new spill-to-disk design for hash-based 
aggregates (and later SetOps), and we plan to begin implementation very soon.  
Since this is an important fix for the community as well, we would be happy 
(and would prefer) to share our spill-to-disk implementation. 

Our new spill approach is very similar to Jeff’s and Heikki’s and is designed 
to use simplehash.h. It’s based on an algorithm called “HybridHash with 
Pre-Partitioning” found in [1]. It may later make sense to implement the 
“HashSort” Algorithm from [1] as well, which works better for highly skewed 
grouping keys. The optimizer could eventually choose between the two based on 
the stats available. We also like Heikki’s suggestions to use logtape.c to 
reduce disk usage and a trie-based approach to control the size of partitions 
dynamically.

We’ve also been grappling with how to handle the implementation challenges 
pointed out in this thread. These include:
        • tracking memory usage
        • choosing a smart eviction policy (which is touched on in [2])
        • serializing opaque user-defined transition values when eviction is 
required

For 1), we plan to use our WithStats memory context, which Serge mentioned.
For 2), we plan to start with a simple non-eviction policy since we don’t have 
the stats do anything smarter (i.e. insert until we fill the hash table, then 
spill until we finish processing the batch, with evictions only happening if a 
group’s transition value grows too large).
For 3), we don’t have a good solution yet. We could serialize/deserialize for 
built-in types and rely on users to provide serialize/deserialize functions for 
user-defined aggregates going forward.

But we’re open to suggestions :)

Regards,
David
Salesforce

[1] Revisiting Aggregation for Data Intensive Applications: A Performance Study
https://arxiv.org/pdf/1311.0059.pdf

[2] DB2 with BLU acceleration: so much more than just a column store
http://delivery.acm.org/10.1145/2540000/2536233/p1080-raman.pdf?ip=204.14.239.107&id=2536233&acc=ACTIVE%20SERVICE&key=37B0A9F49C26EEFC%2E37B0A9F49C26EEFC%2E4D4702B0C3E38B35%2E4D4702B0C3E38B35&__acm__=1528414374_aeb9f862ae2acc26db305d591095e3f7

Reply via email to