Re: [HACKERS] Less than ideal error reporting in pg_stat_statements

Peter Geoghegan Sun, 04 Oct 2015 16:13:52 -0700

On Sun, Oct 4, 2015 at 3:10 PM, Tom Lane <[email protected]> wrote:
>> That seems perfectly reasonable, yes. Should I leave that to you?
>
> After a closer look I decided that wasn't reasonable at all.  Discounting
> sticky texts would then mean that after a GC cycle, we might still think
> the query texts file is bloated and issue another GC request, which of
> course would not shrink the file, so that we'd be doing GC cycles every
> time we added a new query.


That seems unlikely to me. Firstly, there is a generic check of 512
bytes per entry within need_gc_qtexts() -- we never GC if that generic
limit is not exceeded, and that's far from tiny. Secondly, how long do
you think those sticky entries will stay around in the hastable to
continue to cause trouble, and dominate over regular entries? There'd
have to be constant evictions/cache pressure for this situation to
occur, because the threshold within need_gc_qtexts() is based on
pg_stat_statements.max, and yet many evictions aggressively evict
sticky entries very quickly. Seems like a very unusual workload to me.

I think that you're overestimating the cost of discounting sticky
entries, which are usually very much in the minority (at least when it
matters, with cache pressure), and underestimating the cost of
continuing to weigh sticky entries' contribution to mean query text.

As I said, my proposal to not have sticky entries contribute to
overall mean query text length is based on a problem report involving
a continually failing data integration process. That in particular is
what I hope to stop having a negative impact on mean query length -- a
unique queryid makes for an entry that is bound to remain useless
forever (as opposed to just failing the first few times). With the
larger limit of MaxAllocHugeSize, I worry about swapping with the
exclusive lock held.

The fact that there is a big disparity between mean query length for
sticky and non-sticky entries is weird. It was seen to happen in the
wild only because the sticky entries that clogged things up were not
really distinct to a human -- only to the fingerprinting/jumbling
code, more or less by accident, which in a practical sense caused a
distortion.  That's what I'm targeting. I have a hard time imagining
any harm from discounting sticky entries with a realistic case.

> The mean query len recorded by gc_qtexts()
> *has to* match the mean length of what it actually put in the file, not
> the mean length of what we might wish it would put in the file.

Why? Why not simply care about whether or not the file was
unreasonably large relative to available, useful query statistics? I
think that mean_query_len isn't something that swings all that much
with realistic workloads and 5,000 representative entries -- it
certainly should not swing wildly. The problem to an extent was that
that accidentally stopped being true.

> By the same token, I'm back to believing that it's fundamentally bogus for
> entry_dealloc() to compute mean_query_len that way.  The most likely
> result of that is useless GC cycles.  The only thing that will actually
> free memory when you've got a lot of dead sticky entries is getting rid of
> the sticky hashtable entries, and the only way to do that is to wait for
> entry_dealloc() to get rid of 'em.  You were unenthused about making that
> removal more aggressive, which is fine, but you can't have it both ways.

Meanwhile, mean_query_len grows as more "distinct" entries are
created, pushing out their garbage collection further and further.
Especially when there is a big flood of odd queries that stay sticky.
Once again, I'm having a really hard time imagining a minority of
current, non-sticky hashtable entries dominating a majority of
current, sticky hashtable entries come garbage collection time.
Garbage collection is linked to creation of entries, which is also
linked to eviction, which aggressively evicts sticky entries in this
thrashing scenario that you describe.

> It does strike me that when we do get rid of the sticky entries, cleanup
> of the texts file might lag a bit longer than it needs to because
> mean_query_len is computed before not after deleting whatever entries
> we're going to delete.  On average, that shouldn't matter ... but if we
> are tossing a bunch of dead sticky entries, maybe they would have a higher
> mean length than the rest?

Yes, that is something that was observed to happen in the problem case
I looked into. You know my solution already.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Less than ideal error reporting in pg_stat_statements

Reply via email to