On Wed, Sep 3, 2014 at 7:34 PM, Tomas Vondra <t...@fuzzy.cz> wrote: >> Well, I think you're certainly right that a hash table lookup is more >> expensive than modulo on a 32-bit integer; so much is obvious. But if >> join can increase the number of batches on the fly, but only by >> doubling it, so you might go from 4 batches to 8 when 5 would really >> have been enough. And a hash join also can't *reduce* the number of >> batches on the fly, which might matter a lot. Getting the number of >> batches right avoids I/O, which is a lot more expensive than CPU. > > Regarding the estimates, I don't see much difference between the two > approaches when handling this issue. > > It's true you can wait with deciding how many partitions (aka batches) > to create until work_mem is full, at which point you have more > information than at the very beginning. You know how many tuples you've > already seen, how many tuples you expect (which is however only an > estimate etc.). And you may use that to estimate the number of > partitions to create.
I think it's significantly better than that. The first point I'd make is that if work_mem never fills up, you don't need to batch anything at all. That's a potentially huge win over batching a join we thought was going to overrun work_mem, but didn't. But even work_mem does fill up, I think we still come out ahead, because we don't necessarily need to dump the *entirety* of each batch to disk. For example, suppose there are 900 distinct values and only 300 of them can fit in memory at a time. We read the input until work_mem is full and we see a previously-unseen value, so we decide to split the input up into 4 batches. We now finish reading the input. Each previously-seen value gets added to an existing in-memory group, and each each new value gets written into one of four disk files. At the end of the input, 300 groups are complete, and we have four files on disk each of which contains the data for 150 of the remaining 600 groups. Now, the alternative strategy is to batch from the beginning. Here, we decide right from the get-go that we're using 4 batches, so batch #1 goes into memory and the remaining 3 batches get written to three different disk files. At the end of the input, 225 groups are complete, and we have three files on disk each of which contains the data for 225 of the remaining 675 groups. This seems clearly inferior, because we have written 675 groups to disk when it would have been possible to write only 600. The gains can be even more significant when the input data is skewed. For example, suppose things are as above, but ten values accounts for 90% of all the inputs, and the remaining 890 values account for the other 10% of the inputs. Furthermore, let's suppose we have no table statistics or they are totally wrong. In Jeff's approach, as long as each of those values occurs at least once before work_mem fills up, they'll all be processed in the initial pass through the data, which means we will write at most 10% of the data to disk. In fact it will be a little bit less, because batch 1 will have not only then 10 frequently-occurring values but also 290 others, so our initial pass through the data will complete 300 groups covering (if the less-frequent values are occur with uniform frequency) 93.258% of the data. The remaining ~6.8% will be split up into 4 files which we can then reread and process. But if we use the other approach, we'll only get 2 or 3 of the 10 commonly-occurring values in the first batch, so we expect to write about 75% of the data out to one of our three batch files. That's a BIG difference - more than 10x the I/O load that Jeff's approach would have incurred. Now, admittedly, we could use a skew optimization similar to the one we use for hash joins to try to get the MCVs into the first batch, and that would help a lot when the statistics are right - but sometimes the statistics are wrong, and Jeff's approach doesn't care. It just keeps on working. > That however comes at a cost - it's not really a memory-bounded hash > aggregate, because you explicitly allow exceeding work_mem as more > tuples for existing groups arrive. Well, that would be true for now, but as has been mentioned, we can add new methods to the aggregate infrastructure to serialize and de-serialize transition states. I guess I agree that, in the absence of such infrastructure, your patch might be a better way to handle cases like array_agg, but I'm pretty happy to see that infrastructure get added. Hmm. It occurs to me that it could also be really good to add a "merge transition states" operator to the aggregate infrastructure. That would allow further improvements to Jeff's approach for cases like array_agg. If we serialize a transition state to disk because it's not fitting in memory, we don't need to reload it before continuing to process the group, or at least not right away. We can instead just start a new transitions state and then merge all of the accumulated states at the end of the hash join. That's good, because it means we're not using up precious work_mem for transition state data that really isn't needed until it's time to start finalizing groups. And it would be useful for parallelism eventually, too. :-) > Also, no one really says the initial estimate of how many tuples will be > aggregated is correct. It's about as unreliable as the group count > estimate. So how exactly are you going to estimate the partitions? > > Considering this, I doubt being able to choose arbitrary number of > partitions (instead of only powers of 2) is really an advantage. You're right. I was using the terminology in an imprecise and misleading way. What I meant was more along the lines of what's in the first four paragraphs of this email - namely, that with Jeff's approach, it seems that you can be certain of using all the memory you have available on the first pass through, whereas with your approach there seems to be a risk of dumping data to disk that could have been kept in memory and processed. Also, it's very likely that all of the frequently-occurring values will get handled in the initial pass. To put this another way, and I think we all agree on this, I think we should be very concerned with minimizing the number of times the data gets rewritten. If the data doesn't fit in memory, we're going to have to rewrite at least some of it. But the algorithm we choose could cause us to rewrite more of it than necessary, and that's bad. > Whe I think we should prevent is under-estimating the number of batches, > because in that case you have to read the whole batch, write part of it > again and then read it again. Instead of just writing it once (into two > files). Reading a tuple from a batch only to write it to another batch > is not really efficient. Completely agreed. Choosing a partition count that is higher than necessary doesn't hurt much. The expensive part is spilling the tuples to disk for processing in a future batch rather than processing them immediately. Once we've decided we're going to do that one way or the other, the cost of distributing the tuples we decide to write among (say) 16 tapes vs. 4 tapes is probably relatively small. (At some point this breaks down; 1024 tapes will overflow the FD table.) But picking a partition count that is too low could be extremely expensive, in that, as you say, we'd need to rewrite the data a second time. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (email@example.com) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers