Does the data need to be written into a disk-friendly format when a partition 
is selected to be written to disk? If you are careful in your choice of format 
then it doesn’t need to be re-written. And in fact you can start with the 
assumption that everything is going to disk.

One of the most beautiful innovations of the HHJ paper [1] is the 
histogram-driven partitioning. Basically, during phase 1 you apply the phase 2 
hash function to assign rows to “sub-partitions”. Partition 0 would contain 
sub-partitions 0.0, 0.1, … 0.n; partition 1 would contain sub-partitions 1.1, 
…, 1.n. The rows are all mixed together in each partition, but you know how 
many rows (and bytes) are in each sub-partition. If partition 0 (or any 
partition) ends up larger than memory then you are going to need a phase 3. But 
you can enter phase 2 armed with some very useful knowledge. You know the sizes 
of the sub-partitions and you can choose a hash function in phase 2 such that 
many of the partitions end up *just* smaller than memory.

The big problem with external sort and hash algorithms is the huge performance 
hit when you require an extra phase. If you need 2 phases, HHJ can convert that 
to 1.5 phases (by pulling smaller partitions back into memory) and by 
optimizing the assignment of rows to partitions it can turn a 3.1 phase query 
into a 2.9 phase query - a big win.

Julian

[1] 
https://pdfs.semanticscholar.org/fc1c/78cbef5062cf49fdb309b1935af08b759d2d.pdf 
<https://pdfs.semanticscholar.org/fc1c/78cbef5062cf49fdb309b1935af08b759d2d.pdf>



> On Jan 14, 2017, at 7:34 AM, Boaz Ben-Zvi <[email protected]> wrote:
> 
>  Sorry for no attachment (Apache mail rules) -- Here is a link to the 
> document:
> 
> 
> DrillSpillmemoryforHashAggregation.pdf - 
> https://drive.google.com/file/d/0ByUg32jfEW16ajNiQlVRczhPTjA/view?usp=sharing
> 
> [https://lh3.googleusercontent.com/U9FNbWEBljT-HDRBE1-vhMnE4Ug5YFgutztvbys2UnTiVp-FQX6mzQ=w1200-h630-p]<https://drive.google.com/file/d/0ByUg32jfEW16ajNiQlVRczhPTjA/view?usp=sharing>
> 
> DrillSpillmemoryforHashAggregation.pdf<https://drive.google.com/file/d/0ByUg32jfEW16ajNiQlVRczhPTjA/view?usp=sharing>
> drive.google.com
> 
> 
> 
>    -- Boaz
> 
> ________________________________
> From: Julian Hyde <[email protected]>
> Sent: Friday, January 13, 2017 11:00 PM
> To: [email protected]
> Subject: Re: Drill: Memory Spilling for the Hash Aggregate Operator
> 
> The attachment didn't come through. I'm hoping that you settled on a "hybrid" 
> hash algorithm that can write to disk, or write to memory, and the cost of 
> discovering that is wrong is not too great. With Goetz Graefe's hybrid hash 
> join (which can be easily adapted to hybrid hash aggregate) if the input 
> ALMOST fits in memory you could process most of it in memory, then revisit 
> the stuff you spilled to disk.
> 
>> On Jan 13, 2017, at 7:46 PM, Boaz Ben-Zvi <[email protected]> wrote:
>> 
>> Hi Drill developers,
>> 
>>     Attached is a document describing the design for memory spilling 
>> implementation for the Hash Aggregate operator.
>> 
>>     Please send me any comments or questions,
>> 
>>        -- Boaz
> 

Reply via email to