On Wed, Sep 3, 2014 at 7:34 PM, Tomas Vondra <t...@fuzzy.cz> wrote:
>> Well, I think you're certainly right that a hash table lookup is more
>> expensive than modulo on a 32-bit integer; so much is obvious.  But if
>> join can increase the number of batches on the fly, but only by
>> doubling it, so you might go from 4 batches to 8 when 5 would really
>> have been enough.  And a hash join also can't *reduce* the number of
>> batches on the fly, which might matter a lot.  Getting the number of
>> batches right avoids I/O, which is a lot more expensive than CPU.
>
> Regarding the estimates, I don't see much difference between the two
> approaches when handling this issue.
>
> It's true you can wait with deciding how many partitions (aka batches)
> to create until work_mem is full, at which point you have more
> information than at the very beginning. You know how many tuples you've
> already seen, how many tuples you expect (which is however only an
> estimate etc.). And you may use that to estimate the number of
> partitions to create.

I think it's significantly better than that.  The first point I'd make
is that if work_mem never fills up, you don't need to batch anything
at all.  That's a potentially huge win over batching a join we thought
was going to overrun work_mem, but didn't.

But even work_mem does fill up, I think we still come out ahead,
because we don't necessarily need to dump the *entirety* of each batch
to disk.  For example, suppose there are 900 distinct values and only
300 of them can fit in memory at a time.  We read the input until
work_mem is full and we see a previously-unseen value, so we decide to
split the input up into 4 batches.  We now finish reading the input.
Each previously-seen value gets added to an existing in-memory group,
and each each new value gets written into one of four disk files.  At
the end of the input, 300 groups are complete, and we have four files
on disk each of which contains the data for 150 of the remaining 600
groups.

Now, the alternative strategy is to batch from the beginning.  Here,
we decide right from the get-go that we're using 4 batches, so batch
#1 goes into memory and the remaining 3 batches get written to three
different disk files.  At the end of the input, 225 groups are
complete, and we have three files on disk each of which contains the
data for 225 of the remaining 675 groups.  This seems clearly
inferior, because we have written 675 groups to disk when it would
have been possible to write only 600.

The gains can be even more significant when the input data is skewed.
For example, suppose things are as above, but ten values accounts for
90% of all the inputs, and the remaining 890 values account for the
other 10% of the inputs.  Furthermore, let's suppose we have no table
statistics or they are totally wrong.  In Jeff's approach, as long as
each of those values occurs at least once before work_mem fills up,
they'll all be processed in the initial pass through the data, which
means we will write at most 10% of the data to disk.  In fact it will
be a little bit less, because batch 1 will have not only then 10
frequently-occurring values but also 290 others, so our initial pass
through the data will complete 300 groups covering (if the
less-frequent values are occur with uniform frequency) 93.258% of the
data.  The remaining ~6.8% will be split up into 4 files which we can
then reread and process.  But if we use the other approach, we'll only
get 2 or 3 of the 10 commonly-occurring values in the first batch, so
we expect to write about 75% of the data out to one of our three batch
files.  That's a BIG difference - more than 10x the I/O load that
Jeff's approach would have incurred.  Now, admittedly, we could use a
skew optimization similar to the one we use for hash joins to try to
get the MCVs into the first batch, and that would help a lot when the
statistics are right - but sometimes the statistics are wrong, and
Jeff's approach doesn't care.  It just keeps on working.

> That however comes at a cost - it's not really a memory-bounded hash
> aggregate, because you explicitly allow exceeding work_mem as more
> tuples for existing groups arrive.

Well, that would be true for now, but as has been mentioned, we can
add new methods to the aggregate infrastructure to serialize and
de-serialize transition states.  I guess I agree that, in the absence
of such infrastructure, your patch might be a better way to handle
cases like array_agg, but I'm pretty happy to see that infrastructure
get added.

Hmm.  It occurs to me that it could also be really good to add a
"merge transition states" operator to the aggregate infrastructure.
That would allow further improvements to Jeff's approach for cases
like array_agg.  If we serialize a transition state to disk because
it's not fitting in memory, we don't need to reload it before
continuing to process the group, or at least not right away.  We can
instead just start a new transitions state and then merge all of the
accumulated states at the end of the hash join.  That's good, because
it means we're not using up precious work_mem for transition state
data that really isn't needed until it's time to start finalizing
groups.  And it would be useful for parallelism eventually, too.  :-)

> Also, no one really says the initial estimate of how many tuples will be
> aggregated is correct. It's about as unreliable as the group count
> estimate. So how exactly are you going to estimate the partitions?
>
> Considering this, I doubt being able to choose arbitrary number of
> partitions (instead of only powers of 2) is really an advantage.

You're right.  I was using the terminology in an imprecise and
misleading way.  What I meant was more along the lines of what's in
the first four paragraphs of this email - namely, that with Jeff's
approach, it seems that you can be certain of using all the memory you
have available on the first pass through, whereas with your approach
there seems to be a risk of dumping data to disk that could have been
kept in memory and processed.  Also, it's very likely that all of the
frequently-occurring values will get handled in the initial pass.

To put this another way, and I think we all agree on this, I think we
should be very concerned with minimizing the number of times the data
gets rewritten.  If the data doesn't fit in memory, we're going to
have to rewrite at least some of it.  But the algorithm we choose
could cause us to rewrite more of it than necessary, and that's bad.

> Whe I think we should prevent is under-estimating the number of batches,
> because in that case you have to read the whole batch, write part of it
> again and then read it again. Instead of just writing it once (into two
> files). Reading a tuple from a batch only to write it to another batch
> is not really efficient.

Completely agreed.  Choosing a partition count that is higher than
necessary doesn't hurt much.  The expensive part is spilling the
tuples to disk for processing in a future batch rather than processing
them immediately.  Once we've decided we're going to do that one way
or the other, the cost of distributing the tuples we decide to write
among (say) 16 tapes vs. 4 tapes is probably relatively small.  (At
some point this breaks down; 1024 tapes will overflow the FD table.)
But picking a partition count that is too low could be extremely
expensive, in that, as you say, we'd need to rewrite the data a second
time.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to