The design in the document has an undocumented assumption — allow the memory
available for an operator to fluctuate during the operator’s lifetime.
For example, the Hash Agregate operator (HAG) may hit the memory limit at
500MB, than later hit it again at 600MB, etc, and when reading a spilled
partition back from disk the limit may be 400MB, etc.
(A more gentle scheme may allow memory to increase, never go down; e.g., other
operators may "give back” some of their allocations during execution;
e.g. HHJ after finishing the build phase — can yield the extra memory to
another operator).
This is a more sophisticated memory management design than what we have now — a
simple fixed pre-allocation.
A second point is Hash Aggregation — which is a little different from HHJ. For
example, spilling a partition X (in multiple spill iterations) ends up with
500MB on disk, but only (a fixed) 400MB memory is available.
Does this mean that partition X would not fit into the memory ? No, as X may
contain many duplicate groups (groups were recreated after every spill), so as
we read and aggregate X, its size may “shrink” enough for the 400MB.
Indeed the design was not trying to avoid “re-writing to disk”; however the
assumption was that this would be a rare situation, and may only apply to a
part of the data.
If the initial number of partitions chosen N is large enough, then spilling
would happen only once per (some of the) data.
For HAG, the partitions should all be of similar sizes, so if N was chosen too
small, they all would need a “secondary spilling” which would include
re-writing all the data (that’s the analogous to “phase 3” ….)
HHJ is a little different (due to possible key duplicate rows) — some
partitions may be bigger than others. So maybe only few HHJ partitions would be
written twice.
The downside of the “sub-partition” model described in Goetz’ paper is that
indeed a second write of all the data is saved, but all the data (mixed
together) needs to be read (up to) n times (plus some CPU overhead to filter
out the needed parts).
Read may be costly, due to seeks (e.g., when a partition was spilled in many
iterations, and ended scattered across the disk.)
For the (future) HHJ design — we could handle a “too big” partition using a
“hash loop”; that is, read only a part of the inner partition, then scan the
whole outer matching partition, then read the next inner part, and scan the
whole outer partition again.
This is indeed costly, but not much different from the “sub-partitioning”
scheme — we just read part of the inner based on size (instead of several
parts, like 0.0, 0.1, and 0.2), and then the whole outer without filtering -
the hash table probing would do that.
Thanks for the suggestions and the link; I’ll go over Goetz’ paper again
and look for more ideas.
— Boaz
> On Jan 16, 2017, at 4:09 PM, Julian Hyde <[email protected]> wrote:
>
> Does the data need to be written into a disk-friendly format when a partition
> is selected to be written to disk? If you are careful in your choice of
> format then it doesn’t need to be re-written. And in fact you can start with
> the assumption that everything is going to disk.
>
> One of the most beautiful innovations of the HHJ paper [1] is the
> histogram-driven partitioning. Basically, during phase 1 you apply the phase
> 2 hash function to assign rows to “sub-partitions”. Partition 0 would contain
> sub-partitions 0.0, 0.1, … 0.n; partition 1 would contain sub-partitions 1.1,
> …, 1.n. The rows are all mixed together in each partition, but you know how
> many rows (and bytes) are in each sub-partition. If partition 0 (or any
> partition) ends up larger than memory then you are going to need a phase 3.
> But you can enter phase 2 armed with some very useful knowledge. You know the
> sizes of the sub-partitions and you can choose a hash function in phase 2
> such that many of the partitions end up *just* smaller than memory.
>
> The big problem with external sort and hash algorithms is the huge
> performance hit when you require an extra phase. If you need 2 phases, HHJ
> can convert that to 1.5 phases (by pulling smaller partitions back into
> memory) and by optimizing the assignment of rows to partitions it can turn a
> 3.1 phase query into a 2.9 phase query - a big win.
>
> Julian
>
> [1]
> https://pdfs.semanticscholar.org/fc1c/78cbef5062cf49fdb309b1935af08b759d2d.pdf
>
> <https://pdfs.semanticscholar.org/fc1c/78cbef5062cf49fdb309b1935af08b759d2d.pdf>
>
>
>
>> On Jan 14, 2017, at 7:34 AM, Boaz Ben-Zvi <[email protected]> wrote:
>>
>> Sorry for no attachment (Apache mail rules) -- Here is a link to the
>> document:
>>
>>
>> DrillSpillmemoryforHashAggregation.pdf -
>> https://drive.google.com/file/d/0ByUg32jfEW16ajNiQlVRczhPTjA/view?usp=sharing
>>
>> [https://lh3.googleusercontent.com/U9FNbWEBljT-HDRBE1-vhMnE4Ug5YFgutztvbys2UnTiVp-FQX6mzQ=w1200-h630-p]<https://drive.google.com/file/d/0ByUg32jfEW16ajNiQlVRczhPTjA/view?usp=sharing>
>>
>> DrillSpillmemoryforHashAggregation.pdf<https://drive.google.com/file/d/0ByUg32jfEW16ajNiQlVRczhPTjA/view?usp=sharing>
>> drive.google.com
>>
>>
>>
>> -- Boaz
>>
>> ________________________________
>> From: Julian Hyde <[email protected]>
>> Sent: Friday, January 13, 2017 11:00 PM
>> To: [email protected]
>> Subject: Re: Drill: Memory Spilling for the Hash Aggregate Operator
>>
>> The attachment didn't come through. I'm hoping that you settled on a
>> "hybrid" hash algorithm that can write to disk, or write to memory, and the
>> cost of discovering that is wrong is not too great. With Goetz Graefe's
>> hybrid hash join (which can be easily adapted to hybrid hash aggregate) if
>> the input ALMOST fits in memory you could process most of it in memory, then
>> revisit the stuff you spilled to disk.
>>
>>> On Jan 13, 2017, at 7:46 PM, Boaz Ben-Zvi <[email protected]> wrote:
>>>
>>> Hi Drill developers,
>>>
>>> Attached is a document describing the design for memory spilling
>>> implementation for the Hash Aggregate operator.
>>>
>>> Please send me any comments or questions,
>>>
>>> -- Boaz
>>
>