Oleg Bartunov wrote:
On Sat, 8 Mar 2008, Jan Urbaski wrote:
OK, after reading through the some of the code the idea is to write a custom typanalyze function for tsvector columns. It could look inside

such function already exists, it's ts_stat(). The problem with ts_stat() is
its performance, since it sequentually scans ALL tsvectors. It's possible to write special function for tsvector data type, which will be used by analyze, but I'm not sure sampling is a good approach here.

I too was concerned about whether looking through the tsvectors of 100 random documents would be enough to get satisfying results. But maybe it would? I haven't seen many databases in my life, but I have a feeling that in many cases the documents stored in them have many things in common, they use similar vocabulary or so. For example, a database containing the PostgreSQL documentation would have a lot of records with the word 'tuple' or 'column' and a sampling approach should catch at least these most common words.

By the way, does ALTER TABLE SET STATISTICS influence the number of tuples chosen by ANALYZE to do it's work, or just the amount of statistics extracted from that sample? It wasn't obvious to me while reading the code.

The way we could improve performance of gathering stats using ts_stat() is to process only new documents. It may be not as fast as it looks because of
lot of updates, so one need to think more about.

That's the second approach I thought about. Suppose we let the user choose a table built by doing a INSERT ... SELECT * FROM ts_stat(...) and associate it somehow with the table containing the indexed documents. We could then give quite precise estimates on text search queries by looking for lexemes frequencies in that table. We could also give the users a trigger function, similar to tsvector_update_trigger, that could be installed for inserts, updates and deletes and would "merge" the changes to a document with this statistics table. Or just cram that data into pg_statistics and make the trigger operate on that?

This looks like a more complicated approach. It sure would give a whole lot better results, but I'm worried about the additional overhead on every non-readonly operation. Anyway, I'm not even sure if this approach is sane: if we're to do it for tsvectors (essentialy keep an extra table with the computed frequencies of values from a column), then why not do it for integers?

Unfortunately, selectivity estimation for query is much difficult than just estimate frequency of individual word.

Sure, given something like 'cats & dogs'::tsquery the frequency of 'cat' and 'dog' won't suffice. But at least it's a starting point and if we estimate that 80% of the documents have 'dog' and 70% have 'cat' then we can tell for sure that at least 50% have both and that's a lot better than 0.1% that's being returned now.

Regards,
Jan

--
Jan Urbanski
GPG key ID: E583D7D2

ouden estin

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to