Jan Urbański wrote:
If you think the Lossy Counting method has potential, I could test it somehow. Using my current work I could extract a stream of lexemes as ANALYZE sees it and run it through a python implementation of the algorithm to see if the result makes sense.

I hacked together a simplistic python implementation and ran it on a table with 244901 tsvectors, 45624891 lexemes total. I was comparing results from my current approach with the results I'd get from a Lossy Counting algorithm. I experimented with statistics_target set to 10 and 100, and ran pruning in the LC algorithm every 3, 10 or 100 tsvectors. The sample size with statistics_target set to 100 was 30000 rows and that's what the input to the script was - lexemes from these 30000 tsvectors. I found out that with pruning happening every 10 tsvectors I got precisely the same results as with the original algorithm (same most common lexemes, same frequencies). When I tried pruning after every 100 tsvectors the results changed very slightly (they were a tiny bit more distant from the ones from the original algorithm, and I think a tiny bit more precise, but I didn't give it much attention).

Bottom line seems to be: the Lossy Counting algorithm gives roughly the same results as the algorithm used currently and is also possibly faster (and more scalable wrt. statistics_target).

This should probably get more testing than just running some script 5 times over a fixed set of data, but I had trouble already sucking ~300 MB of tsvectors from one of my production sites, putting it on my laptop and so on. Do you think it's worthwhile to implement the LC algorithm in C and send it out, so others could try it out? Heck, maybe it's worthwhile to replace the current compute_minimal_stats() algorithm with LC and see how that compares?

Anyway, I can share the python script if someone would like to do some more tests (I suppose no-one would, 'cause you first need to apply my ts_typanalyze patch and then change it some more to extract lexemes from the sample).

Cheers,
Jan

--
Jan Urbanski
GPG key ID: E583D7D2

ouden estin


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to