Re: [HACKERS] gsoc, text search selectivity and dllist enhancments

Jan Urbański Tue, 08 Jul 2008 15:35:22 -0700

Jan Urbański wrote:

If you think the Lossy Counting method has potential, I could test itsomehow. Using my current work I could extract a stream of lexemes asANALYZE sees it and run it through a python implementation of thealgorithm to see if the result makes sense.

I hacked together a simplistic python implementation and ran it on atable with 244901 tsvectors, 45624891 lexemes total. I was comparingresults from my current approach with the results I'd get from a LossyCounting algorithm.I experimented with statistics_target set to 10 and 100, and ran pruningin the LC algorithm every 3, 10 or 100 tsvectors.The sample size with statistics_target set to 100 was 30000 rows andthat's what the input to the script was - lexemes from these 30000tsvectors.I found out that with pruning happening every 10 tsvectors I gotprecisely the same results as with the original algorithm (same mostcommon lexemes, same frequencies). When I tried pruning after every 100tsvectors the results changed very slightly (they were a tiny bit moredistant from the ones from the original algorithm, and I think a tinybit more precise, but I didn't give it much attention).

Bottom line seems to be: the Lossy Counting algorithm gives roughly thesame results as the algorithm used currently and is also possibly faster(and more scalable wrt. statistics_target).

This should probably get more testing than just running some script 5times over a fixed set of data, but I had trouble already sucking ~300MB of tsvectors from one of my production sites, putting it on my laptopand so on.Do you think it's worthwhile to implement the LC algorithm in C and sendit out, so others could try it out? Heck, maybe it's worthwhile toreplace the current compute_minimal_stats() algorithm with LC and seehow that compares?

Anyway, I can share the python script if someone would like to do somemore tests (I suppose no-one would, 'cause you first need to apply myts_typanalyze patch and then change it some more to extract lexemes fromthe sample).


Cheers,
Jan

--
Jan Urbanski
GPG key ID: E583D7D2

ouden estin


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] gsoc, text search selectivity and dllist enhancments

Reply via email to