Does the data need to be written into a disk-friendly format when a partition is selected to be written to disk? If you are careful in your choice of format then it doesn’t need to be re-written. And in fact you can start with the assumption that everything is going to disk.
One of the most beautiful innovations of the HHJ paper [1] is the histogram-driven partitioning. Basically, during phase 1 you apply the phase 2 hash function to assign rows to “sub-partitions”. Partition 0 would contain sub-partitions 0.0, 0.1, … 0.n; partition 1 would contain sub-partitions 1.1, …, 1.n. The rows are all mixed together in each partition, but you know how many rows (and bytes) are in each sub-partition. If partition 0 (or any partition) ends up larger than memory then you are going to need a phase 3. But you can enter phase 2 armed with some very useful knowledge. You know the sizes of the sub-partitions and you can choose a hash function in phase 2 such that many of the partitions end up *just* smaller than memory. The big problem with external sort and hash algorithms is the huge performance hit when you require an extra phase. If you need 2 phases, HHJ can convert that to 1.5 phases (by pulling smaller partitions back into memory) and by optimizing the assignment of rows to partitions it can turn a 3.1 phase query into a 2.9 phase query - a big win. Julian [1] https://pdfs.semanticscholar.org/fc1c/78cbef5062cf49fdb309b1935af08b759d2d.pdf <https://pdfs.semanticscholar.org/fc1c/78cbef5062cf49fdb309b1935af08b759d2d.pdf> > On Jan 14, 2017, at 7:34 AM, Boaz Ben-Zvi <[email protected]> wrote: > > Sorry for no attachment (Apache mail rules) -- Here is a link to the > document: > > > DrillSpillmemoryforHashAggregation.pdf - > https://drive.google.com/file/d/0ByUg32jfEW16ajNiQlVRczhPTjA/view?usp=sharing > > [https://lh3.googleusercontent.com/U9FNbWEBljT-HDRBE1-vhMnE4Ug5YFgutztvbys2UnTiVp-FQX6mzQ=w1200-h630-p]<https://drive.google.com/file/d/0ByUg32jfEW16ajNiQlVRczhPTjA/view?usp=sharing> > > DrillSpillmemoryforHashAggregation.pdf<https://drive.google.com/file/d/0ByUg32jfEW16ajNiQlVRczhPTjA/view?usp=sharing> > drive.google.com > > > > -- Boaz > > ________________________________ > From: Julian Hyde <[email protected]> > Sent: Friday, January 13, 2017 11:00 PM > To: [email protected] > Subject: Re: Drill: Memory Spilling for the Hash Aggregate Operator > > The attachment didn't come through. I'm hoping that you settled on a "hybrid" > hash algorithm that can write to disk, or write to memory, and the cost of > discovering that is wrong is not too great. With Goetz Graefe's hybrid hash > join (which can be easily adapted to hybrid hash aggregate) if the input > ALMOST fits in memory you could process most of it in memory, then revisit > the stuff you spilled to disk. > >> On Jan 13, 2017, at 7:46 PM, Boaz Ben-Zvi <[email protected]> wrote: >> >> Hi Drill developers, >> >> Attached is a document describing the design for memory spilling >> implementation for the Hash Aggregate operator. >> >> Please send me any comments or questions, >> >> -- Boaz >
