On Thu, Jul 23, 2020 at 07:33:45PM -0700, Peter Geoghegan wrote:
On Thu, Jul 23, 2020 at 6:22 PM Tomas Vondra
<tomas.von...@2ndquadrant.com> wrote:
So let me share some fresh I/O statistics collected on the current code
using iosnoop. I've done the tests on two different machines using the
"aggregate part" of TPC-H Q17, i.e. essentially this:
SELECT * FROM (
SELECT
l_partkey AS agg_partkey,
0.2 * avg(l_quantity) AS avg_quantity
FROM lineitem GROUP BY l_partkey OFFSET 1000000000
) part_agg;
The OFFSET is there just to ensure we don't need to send anything to
the client, etc.
Thanks for testing this.
So sort writes ~3.4GB of data, give or take. But hashagg/master writes
almost 6-7GB of data, i.e. almost twice as much. Meanwhile, with the
original CP_SMALL_TLIST we'd write "only" ~5GB of data. That's still
much more than the 3.4GB of data written by sort (which has to spill
everything, while hashagg only spills rows not covered by the groups
that fit into work_mem).
What I find when I run your query (with my own TPC-H DB that is
smaller than what you used here -- 59,986,052 lineitem tuples) is that
the sort required about 7x more memory than the hash agg to do
everything in memory: 4,384,711KB for the quicksort vs 630,801KB peak
hash agg memory usage. I'd be surprised if the ratio was very
different for you -- but can you check?
I can check, but it's not quite clear to me what are we looking for?
Increase work_mem until there's no need to spill in either case?
I think that there is something pathological about this spill
behavior, because it sounds like the precise opposite of what you
might expect when you make a rough extrapolation of what disk I/O will
be based on the memory used in no-spill cases (as reported by EXPLAIN
ANALYZE).
Maybe, not sure what exactly you think is pathological? The trouble is
hashagg has to spill input tuples but the memory used in no-spill case
represents aggregated groups, so I'm not sure how you could extrapolate
from that ...
FWIW one more suspicious thing that I forgot to mention is the behavior
of the "planned partitions" depending on work_mem, which looks like
this:
2MB Planned Partitions: 64 HashAgg Batches: 4160
4MB Planned Partitions: 128 HashAgg Batches: 16512
8MB Planned Partitions: 256 HashAgg Batches: 21488
64MB Planned Partitions: 32 HashAgg Batches: 2720
256MB Planned Partitions: 8 HashAgg Batches: 8
I'd expect the number of planned partitions to decrease (slowly) as
work_mem increases, but it seems to increase initially. Seems a bit
strange, but maybe it's expected.
What I find really surprising is the costing - despite writing about
twice as much data, the hashagg cost is estimated to be much lower than
the sort. For example on the i5 machine, the hashagg cost is ~10M, while
sort cost is almost 42M. Despite using almost twice as much disk. And
the costing is exactly the same for master and the CP_SMALL_TLIST.
That does make it sound like the costs of the hash agg aren't being
represented. I suppose it isn't clear if this is a costing issue
because it isn't clear if the execution time performance itself is
pathological or is instead something that must be accepted as the cost
of spilling the hash agg in a general kind of way.
Not sure, but I think we need to spill roughly as much as sort, so it
seems a bit strange that (a) we're spilling 2x as much data and yet the
cost is so much lower.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services