Re: [HACKERS] Hash id in pg_stat_statements

Magnus Hagander Thu, 15 Nov 2012 05:11:01 -0800

On Tue, Oct 2, 2012 at 8:22 PM, Peter Geoghegan <pe...@2ndquadrant.com> wrote:
> On 2 October 2012 18:16, Tom Lane <t...@sss.pgh.pa.us> wrote
>> 1. Why isn't something like md5() on the reported query text an equally
>> good solution for users who want a query hash?
>
> Because that does not uniquely identify the entry. The very first
> thing that the docs say on search_path is "Qualified names are tedious
> to write, and it's often best not to wire a particular schema name
> into applications anyway". Presumably, the reason it's best not to
> wire schema names into apps is because it might be useful to modify
> search_path in a way that dynamically made the same queries in some
> application reference what are technically distinct relations. If
> anyone does this, and it seems likely that many do for various
> reasons, they will be out of luck when using some kind of
> pg_stat_statements aggregation.
>
> This was the behaviour that I intended for pg_stat_statements all
> along, and I think it's better than a solution that matches query
> strings.
>
>> 2. If people are going to accumulate stats on queries over a long period
>> of time, is a 32-bit hash really good enough for the purpose?  If I'm
>> doing the math right, the chance of collision is already greater than 1%
>> at 10000 queries, and rises to about 70% for 100000 queries; see
>> http://en.wikipedia.org/wiki/Birthday_paradox
>> We discussed this issue and decided it was okay for pg_stat_statements's
>> internal hash table, but it's not at all clear to me that it's sensible
>> to use 32-bit hashes for external accumulation of query stats.
>
> Well, forgive me for pointing this out, but I did propose that the
> hash be a 64-bit value (which would have necessitated adopting
> hash_any() to produce 64-bit values), but you rejected the proposal. I
> arrived at the same probability for a collision as you did and posted
> in to the list, in discussion shortly after the normalisation stuff
> was committed.


What was the argument for rejecting it? Just that it required
hash_any() to be adapted?

Now that we have a very clear use case where this would help, perhaps
it's time to re-visit this proposal?

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hash id in pg_stat_statements

Reply via email to