Re: Drill: Memory Spilling for the Hash Aggregate Operator

Boaz Ben-Zvi Mon, 16 Jan 2017 18:55:48 -0800

The design in the document has an undocumented assumption — allow the memory 
available for an operator to fluctuate during the operator’s lifetime.
For example, the Hash Agregate operator (HAG) may hit the memory limit at 
500MB, than later hit it again at 600MB, etc, and when reading a spilled 
partition back from disk the limit may be 400MB, etc.
(A more gentle scheme may allow memory to increase, never go down; e.g., other 
operators may "give back” some of their allocations during execution; 
e.g. HHJ after finishing the build phase — can yield the extra memory to 
another operator). 
This is a more sophisticated memory management design than what we have now — a 
simple fixed pre-allocation.

A second point is Hash Aggregation — which is a little different from HHJ. For 
example, spilling a partition X (in multiple spill iterations) ends up with 
500MB on disk, but only (a fixed) 400MB memory is available.
Does this mean that partition X would not fit into the memory ?  No, as X may 
contain many duplicate groups (groups were recreated after every spill), so as 
we read and aggregate X, its size may “shrink” enough for the 400MB.

Indeed the design was not trying to avoid “re-writing to disk”; however the 
assumption was that this would be a rare situation, and may only apply to a 
part of the data.
If the initial number of partitions chosen N is large enough, then spilling 
would happen only once per (some of the) data.

For HAG, the partitions should all be of similar sizes, so if N was chosen too 
small, they all would need a “secondary spilling” which would include 
re-writing all the data (that’s the analogous to “phase 3” ….)
HHJ is a little different (due to possible key duplicate rows) — some 
partitions may be bigger than others. So maybe only few HHJ partitions would be 
written twice.

The downside of the “sub-partition” model described in Goetz’ paper is that 
indeed a second write of all the data is saved, but all the data (mixed 
together) needs to be read (up to) n times (plus some CPU overhead to filter 
out the needed parts).
Read may be costly, due to seeks (e.g., when a partition was spilled in many 
iterations, and ended scattered across the disk.) 
For the (future) HHJ design — we could handle a “too big” partition using a 
“hash loop”; that is, read only a part of the inner partition, then scan the 
whole outer matching partition, then read the next inner part, and scan the 
whole outer partition again.
This is indeed costly, but not much different from the “sub-partitioning” 
scheme — we just read part of the inner based on size (instead of several 
parts, like 0.0, 0.1, and 0.2), and then the whole outer without filtering - 
the hash table probing would do that.

     Thanks for the suggestions and the link; I’ll go over Goetz’ paper again 
and look for more ideas.

          — Boaz

> On Jan 16, 2017, at 4:09 PM, Julian Hyde <[email protected]> wrote:
> 
> Does the data need to be written into a disk-friendly format when a partition 
> is selected to be written to disk? If you are careful in your choice of 
> format then it doesn’t need to be re-written. And in fact you can start with 
> the assumption that everything is going to disk.
> 
> One of the most beautiful innovations of the HHJ paper [1] is the 
> histogram-driven partitioning. Basically, during phase 1 you apply the phase 
> 2 hash function to assign rows to “sub-partitions”. Partition 0 would contain 
> sub-partitions 0.0, 0.1, … 0.n; partition 1 would contain sub-partitions 1.1, 
> …, 1.n. The rows are all mixed together in each partition, but you know how 
> many rows (and bytes) are in each sub-partition. If partition 0 (or any 
> partition) ends up larger than memory then you are going to need a phase 3. 
> But you can enter phase 2 armed with some very useful knowledge. You know the 
> sizes of the sub-partitions and you can choose a hash function in phase 2 
> such that many of the partitions end up *just* smaller than memory.
> 
> The big problem with external sort and hash algorithms is the huge 
> performance hit when you require an extra phase. If you need 2 phases, HHJ 
> can convert that to 1.5 phases (by pulling smaller partitions back into 
> memory) and by optimizing the assignment of rows to partitions it can turn a 
> 3.1 phase query into a 2.9 phase query - a big win.
> 
> Julian
> 
> [1] 
> https://pdfs.semanticscholar.org/fc1c/78cbef5062cf49fdb309b1935af08b759d2d.pdf
>  
> <https://pdfs.semanticscholar.org/fc1c/78cbef5062cf49fdb309b1935af08b759d2d.pdf>
> 
> 
> 
>> On Jan 14, 2017, at 7:34 AM, Boaz Ben-Zvi <[email protected]> wrote:
>> 
>> Sorry for no attachment (Apache mail rules) -- Here is a link to the 
>> document:
>> 
>> 
>> DrillSpillmemoryforHashAggregation.pdf - 
>> https://drive.google.com/file/d/0ByUg32jfEW16ajNiQlVRczhPTjA/view?usp=sharing
>> 
>> [https://lh3.googleusercontent.com/U9FNbWEBljT-HDRBE1-vhMnE4Ug5YFgutztvbys2UnTiVp-FQX6mzQ=w1200-h630-p]<https://drive.google.com/file/d/0ByUg32jfEW16ajNiQlVRczhPTjA/view?usp=sharing>
>> 
>> DrillSpillmemoryforHashAggregation.pdf<https://drive.google.com/file/d/0ByUg32jfEW16ajNiQlVRczhPTjA/view?usp=sharing>
>> drive.google.com
>> 
>> 
>> 
>>   -- Boaz
>> 
>> ________________________________
>> From: Julian Hyde <[email protected]>
>> Sent: Friday, January 13, 2017 11:00 PM
>> To: [email protected]
>> Subject: Re: Drill: Memory Spilling for the Hash Aggregate Operator
>> 
>> The attachment didn't come through. I'm hoping that you settled on a 
>> "hybrid" hash algorithm that can write to disk, or write to memory, and the 
>> cost of discovering that is wrong is not too great. With Goetz Graefe's 
>> hybrid hash join (which can be easily adapted to hybrid hash aggregate) if 
>> the input ALMOST fits in memory you could process most of it in memory, then 
>> revisit the stuff you spilled to disk.
>> 
>>> On Jan 13, 2017, at 7:46 PM, Boaz Ben-Zvi <[email protected]> wrote:
>>> 
>>> Hi Drill developers,
>>> 
>>>    Attached is a document describing the design for memory spilling 
>>> implementation for the Hash Aggregate operator.
>>> 
>>>    Please send me any comments or questions,
>>> 
>>>       -- Boaz
>> 
>

Re: Drill: Memory Spilling for the Hash Aggregate Operator

Reply via email to