On Tue, Mar 16, 2021 at 12:55:45PM +0100, Magnus Hagander wrote:
> On Tue, Mar 9, 2021 at 3:39 AM Julien Rouhaud <rjuju...@gmail.com> wrote:
> >
> > I think that we might be able to handle that without a flag.  The only thing
> > that would need to be done is when creating an entry, look for an existing
> > entry with the opposite flag, and if there's simply use the same
> > (query_offset, query_len) info.  This doesn't sound that expensive.
> 
> That's basically what I was trying to say :)

Oh ok sorry :)

> > The real pain point will be that the garbage collection phase
> > will become way more expensive as it will now have to somehow maintain that
> > knowledge, which will require additional lookups for each entry.  I'm a bit
> > concerned about that, especially with the current heuristic to schedule 
> > garbage
> > collection.  For now, need_qc_qtext says that we have to do it if the 
> > extent is
> > more than 512 (B) * pgss_max.  This probably doesn't work well for people 
> > using
> > ORM as they tend to generate gigantic SQL queries.
> 
> Right, the cost would be mostly on the GC  side. I've never done any
> profiling to see how big of a thing that is in systems today -- have
> you?

I didn't, but I don't see how it could be anything but ridiculously impacting.
it's basically preventing any query from being planned or executed on the whole
instance the time needed to read the previous qtext file, and write all entries
still needed.

> > I don't that think that anyone really had a strong argument, mostly gut
> > feeling.  Note that pg_stat_kcache already implemented that toplevel flags, 
> > so
> > if people are using that extension in a recent version they might have some
> > figures to show.  I'll ping some people that I know are using it.
> 
> Great -- data always wins over gut feelings :)

So I asked some friends that have latest pg_stat_kcache installed on some
preproduction environment configured to track nested queries.  There isn't a
high throughput but the activity should still be representative of the
production queries.  There are a lot of applications plugged there, around 20
databases and quite a lot of PL code.

After a few days, here are the statistics:

- total of ~ 9500 entries
- ~ 900 entries for nested statements
- ~ 35 entries existing for both top level and nested statements

So the duplicates account for less than 4% of the nested statements, and less
than 0.5% of the whole entries.

I wish I had more reports, but if this one is representative enough then it
seems that trying to avoid storing duplicated queries wouldn't be worth it.

> > One good argument would be that gigantic queries generated by ORM should 
> > always
> > be executed as top level statements.
> 
> Yes, that's true. And it probably holds as a more generic case as
> well, that is the queries that are likely to show up both top-level
> and lower-level are more likely to be relatively simple ones. (Except
> for example during the development of functions/procs where they're
> often executed top level as well to test etc, but that's not the most
> important case to optimize for)

Agreed.

> > I previously tried with the postgres regression tests, which clearly isn't a
> > representative workload, and as far as I can see the vast majority of 
> > queries
> > executed bost as top level and nested level are DDL implying recursion 
> > (e.g. a
> > CREATE TABLE with underlying index creation).
> 
> I think the PostgreSQL regression tests are so far from a real world
> workload that the input in this case has a value of exactly zero.

I totally agree, but that's the only one I had at that time :)  Still it wasn't
entirely useless as I didn't realize before that that some DDL would lead to
duplicated entries.


Reply via email to