I agree that an aggregate tends to compress data more than a join. Joins do compress data somewhat — when they have a filtering effect — so for both hash-aggregation and hash-join the size estimate is just an upper bound.
I also agree that the hash aggregate will fit data into memory in either the first or second phase. But the “much larger than memory” case is still very important. Think of what you would do to make an almost-unique list of customer ids into a fully unique list. HHJ does not “mix up” sub-partitions, not in a bad way, anyway. By design HHJ uses as many output partitions as possible (available memory divided by block size, because each partition needs one block of buffer space to prepare the data to be written), so it would not be practical to have one output file per sub-partition. There are far more sub-partitions than could be efficiently written to disk, so HHJ merely collects stats for them. All of the data in a partition will be read back at once, so the “mixing up” is not harmful. HHJ’s I/O is extremely efficient; I don’t believe that it ever does random access or reads a byte more than once. And due to its “hybrid” nature, some partition blocks are never completed and written to disk. Anyway. I’ve said my piece. Histograms of sub-partition stats are not an essential part of this algorithm but I wanted to make sure that you were aware of them, because they are as elegant as bloom filters and b-trees. Julian > On Jan 16, 2017, at 6:55 PM, Boaz Ben-Zvi <[email protected]> wrote: > > The design in the document has an undocumented assumption — allow the memory > available for an operator to fluctuate during the operator’s lifetime. > For example, the Hash Agregate operator (HAG) may hit the memory limit at > 500MB, than later hit it again at 600MB, etc, and when reading a spilled > partition back from disk the limit may be 400MB, etc. > (A more gentle scheme may allow memory to increase, never go down; e.g., > other operators may "give back” some of their allocations during execution; > e.g. HHJ after finishing the build phase — can yield the extra memory to > another operator). > This is a more sophisticated memory management design than what we have now — > a simple fixed pre-allocation. > > A second point is Hash Aggregation — which is a little different from HHJ. > For example, spilling a partition X (in multiple spill iterations) ends up > with 500MB on disk, but only (a fixed) 400MB memory is available. > Does this mean that partition X would not fit into the memory ? No, as X may > contain many duplicate groups (groups were recreated after every spill), so > as we read and aggregate X, its size may “shrink” enough for the 400MB. > > Indeed the design was not trying to avoid “re-writing to disk”; however the > assumption was that this would be a rare situation, and may only apply to a > part of the data. > If the initial number of partitions chosen N is large enough, then spilling > would happen only once per (some of the) data. > > For HAG, the partitions should all be of similar sizes, so if N was chosen > too small, they all would need a “secondary spilling” which would include > re-writing all the data (that’s the analogous to “phase 3” ….) > HHJ is a little different (due to possible key duplicate rows) — some > partitions may be bigger than others. So maybe only few HHJ partitions would > be written twice. > > The downside of the “sub-partition” model described in Goetz’ paper is that > indeed a second write of all the data is saved, but all the data (mixed > together) needs to be read (up to) n times (plus some CPU overhead to filter > out the needed parts). > Read may be costly, due to seeks (e.g., when a partition was spilled in many > iterations, and ended scattered across the disk.) > For the (future) HHJ design — we could handle a “too big” partition using a > “hash loop”; that is, read only a part of the inner partition, then scan the > whole outer matching partition, then read the next inner part, and scan the > whole outer partition again. > This is indeed costly, but not much different from the “sub-partitioning” > scheme — we just read part of the inner based on size (instead of several > parts, like 0.0, 0.1, and 0.2), and then the whole outer without filtering - > the hash table probing would do that. > > Thanks for the suggestions and the link; I’ll go over Goetz’ paper again > and look for more ideas. > > — Boaz > > >> On Jan 16, 2017, at 4:09 PM, Julian Hyde <[email protected]> wrote: >> >> Does the data need to be written into a disk-friendly format when a >> partition is selected to be written to disk? If you are careful in your >> choice of format then it doesn’t need to be re-written. And in fact you can >> start with the assumption that everything is going to disk. >> >> One of the most beautiful innovations of the HHJ paper [1] is the >> histogram-driven partitioning. Basically, during phase 1 you apply the phase >> 2 hash function to assign rows to “sub-partitions”. Partition 0 would >> contain sub-partitions 0.0, 0.1, … 0.n; partition 1 would contain >> sub-partitions 1.1, …, 1.n. The rows are all mixed together in each >> partition, but you know how many rows (and bytes) are in each sub-partition. >> If partition 0 (or any partition) ends up larger than memory then you are >> going to need a phase 3. But you can enter phase 2 armed with some very >> useful knowledge. You know the sizes of the sub-partitions and you can >> choose a hash function in phase 2 such that many of the partitions end up >> *just* smaller than memory. >> >> The big problem with external sort and hash algorithms is the huge >> performance hit when you require an extra phase. If you need 2 phases, HHJ >> can convert that to 1.5 phases (by pulling smaller partitions back into >> memory) and by optimizing the assignment of rows to partitions it can turn a >> 3.1 phase query into a 2.9 phase query - a big win. >> >> Julian >> >> [1] >> https://pdfs.semanticscholar.org/fc1c/78cbef5062cf49fdb309b1935af08b759d2d.pdf >> >> <https://pdfs.semanticscholar.org/fc1c/78cbef5062cf49fdb309b1935af08b759d2d.pdf> >> >> >> >>> On Jan 14, 2017, at 7:34 AM, Boaz Ben-Zvi <[email protected]> wrote: >>> >>> Sorry for no attachment (Apache mail rules) -- Here is a link to the >>> document: >>> >>> >>> DrillSpillmemoryforHashAggregation.pdf - >>> https://drive.google.com/file/d/0ByUg32jfEW16ajNiQlVRczhPTjA/view?usp=sharing >>> >>> [https://lh3.googleusercontent.com/U9FNbWEBljT-HDRBE1-vhMnE4Ug5YFgutztvbys2UnTiVp-FQX6mzQ=w1200-h630-p]<https://drive.google.com/file/d/0ByUg32jfEW16ajNiQlVRczhPTjA/view?usp=sharing> >>> >>> DrillSpillmemoryforHashAggregation.pdf<https://drive.google.com/file/d/0ByUg32jfEW16ajNiQlVRczhPTjA/view?usp=sharing> >>> drive.google.com >>> >>> >>> >>> -- Boaz >>> >>> ________________________________ >>> From: Julian Hyde <[email protected]> >>> Sent: Friday, January 13, 2017 11:00 PM >>> To: [email protected] >>> Subject: Re: Drill: Memory Spilling for the Hash Aggregate Operator >>> >>> The attachment didn't come through. I'm hoping that you settled on a >>> "hybrid" hash algorithm that can write to disk, or write to memory, and the >>> cost of discovering that is wrong is not too great. With Goetz Graefe's >>> hybrid hash join (which can be easily adapted to hybrid hash aggregate) if >>> the input ALMOST fits in memory you could process most of it in memory, >>> then revisit the stuff you spilled to disk. >>> >>>> On Jan 13, 2017, at 7:46 PM, Boaz Ben-Zvi <[email protected]> wrote: >>>> >>>> Hi Drill developers, >>>> >>>> Attached is a document describing the design for memory spilling >>>> implementation for the Hash Aggregate operator. >>>> >>>> Please send me any comments or questions, >>>> >>>> -- Boaz >>> >> >
