Dne 21.12.2010 16:54, Florian Pflug napsal(a):
> I think that maybe it'd be acceptable to scan a large portion of the
> table to estimate dist(A,B), *if* we must only do so only once in a while. 
> But even with
> a full scan, how would we arrive at an estimate for dist(A,B)? Keeping all 
> values in memory,
> and spilling them into, say, an on-disk hash table adds even more overhead to 
> the already
> expensive full scan. Maybe using a bloom filter instead of a hash table could 
> avoid
> the spilling to disk, in exchange for a slightly less precise result...

I've read some basics about a Bloom filters, and I'm not sure we can use
it in this case. I see two problems with this approach:

1) impossibility to predict the number of elements in advance

   To build a Bloom filter with limited error rate, you need to size it
   properly (number of hash function and size of the bit array). But
   that's exactly the the information we're looking for.

   I guess we could use the highest possible value (equal to the number
   of tuples) - according to wiki you need about 10 bits per element
   with 1% error, i.e. about 10MB of memory for each million of
   elements.

   That's not much but this needs to be done for each column separately
   and for the combination - for N columns you need (at least) N+1
   filters.

   Sure - increasing the error rate and using a different estimate to
   build the bloom filter would significantly decrease the memory
   requirements.

2) sampling the whole table

   A naive approach would be to sample the whole table each time, but
   for large tables that might be a bit of a problem. So I'm thinking
   about what alternatives are there ...

   One possibility is to build the Bloom filter incrementally and store
   it on disk, or something like that. I guess this could be done
   during VACUUM or something like that. Anyway there's an issue how to
   set the filter size initially (estimate the number of elements),
   just as in the previous section.

   Another possibility is to collect the data from just a small portion
   of a table and then use the result to estimate the number of distinct
   values for the whole table. But I'm not sure we can do this reliably,
   I see many traps in this.

But maybe I'm just missing something important.

regards
Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to