Re: Default setting for enable_hashagg_disk

Tomas Vondra Fri, 24 Jul 2020 01:41:26 -0700

On Thu, Jul 23, 2020 at 07:33:45PM -0700, Peter Geoghegan wrote:

On Thu, Jul 23, 2020 at 6:22 PM Tomas Vondra
<tomas.von...@2ndquadrant.com> wrote:

So let me share some fresh I/O statistics collected on the current code
using iosnoop. I've done the tests on two different machines using the
"aggregate part" of TPC-H Q17, i.e. essentially this:


     SELECT * FROM (
        SELECT
            l_partkey AS agg_partkey,
            0.2 * avg(l_quantity) AS avg_quantity
        FROM lineitem GROUP BY l_partkey OFFSET 1000000000
     ) part_agg;

The OFFSET is there just to ensure we don't need to send anything to
the client, etc.


Thanks for testing this.

So sort writes ~3.4GB of data, give or take. But hashagg/master writes
almost 6-7GB of data, i.e. almost twice as much. Meanwhile, with the
original CP_SMALL_TLIST we'd write "only" ~5GB of data. That's still
much more than the 3.4GB of data written by sort (which has to spill
everything, while hashagg only spills rows not covered by the groups
that fit into work_mem).


What I find when I run your query (with my own TPC-H DB that is
smaller than what you used here -- 59,986,052 lineitem tuples) is that
the sort required about 7x more memory than the hash agg to do
everything in memory: 4,384,711KB for the quicksort vs 630,801KB peak
hash agg memory usage. I'd be surprised if the ratio was very
different for you -- but can you check?


I can check, but it's not quite clear to me what are we looking for?
Increase work_mem until there's no need to spill in either case?

I think that there is something pathological about this spill
behavior, because it sounds like the precise opposite of what you
might expect when you make a rough extrapolation of what disk I/O will
be based on the memory used in no-spill cases (as reported by EXPLAIN
ANALYZE).


Maybe, not sure what exactly you think is pathological? The trouble is
hashagg has to spill input tuples but the memory used in no-spill case
represents aggregated groups, so I'm not sure how you could extrapolate
from that ...

FWIW one more suspicious thing that I forgot to mention is the behavior
of the "planned partitions" depending on work_mem, which looks like
this:

      2MB       Planned Partitions:  64    HashAgg Batches:  4160
      4MB       Planned Partitions: 128    HashAgg Batches: 16512
      8MB       Planned Partitions: 256    HashAgg Batches: 21488
     64MB       Planned Partitions:  32    HashAgg Batches:  2720
    256MB       Planned Partitions:   8    HashAgg Batches:     8

I'd expect the number of planned partitions to decrease (slowly) as
work_mem increases, but it seems to increase initially. Seems a bit
strange, but maybe it's expected.

What I find really surprising is the costing - despite writing about
twice as much data, the hashagg cost is estimated to be much lower than
the sort. For example on the i5 machine, the hashagg cost is ~10M, while
sort cost is almost 42M. Despite using almost twice as much disk. And
the costing is exactly the same for master and the CP_SMALL_TLIST.


That does make it sound like the costs of the hash agg aren't being
represented. I suppose it isn't clear if this is a costing issue
because it isn't clear if the execution time performance itself is
pathological or is instead something that must be accepted as the cost
of spilling the hash agg in a general kind of way.


Not sure, but I think we need to spill roughly as much as sort, so it
seems a bit strange that (a) we're spilling 2x as much data and yet the
cost is so much lower.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Default setting for enable_hashagg_disk

Reply via email to