On 2011-01-24 16:34, Alexander Korotkov wrote:
On Mon, Jan 24, 2011 at 3:07 AM, Jan Urbański<wulc...@wulczer.org> wrote:
I see two issues with this patch. First of them is the resulting index
size. I created a table with 5 copies of
/usr/share/dict/american-english in it and a gin index on it, using
gin_trgm_ops. The results were:
* relation size: 18MB
* index size: 109 MB
while without the patch the GIN index was 43 MB. I'm not really sure
*why* this happens, as it's not obvious from reading the patch what
exactly is this extra data that gets stored in the index, making it more
than double its size.
Do you sure that you did comparison correctly? The sequence of index
building and data insertion does matter. I tried to build gin index on 5
copies of /usr/share/dict/american-english with patch and got 43 MB index
That leads me to the second issue. The pg_trgm code is already woefully
uncommented, and after spending quite some time reading it back and
forth I have to admit that I don't really understand what the code does
in the first place, and so I don't understand what does that patch
change. I read all the changes in detail and I could't find any obvious
mistakes like reading over array boundaries or dereferencing
uninitialized pointers, but I can't tell if the patch is correct
semantically. All test cases I threw at it work, though.
I'll try to write sufficient comment and send new revision of patch.
Would it be hard to make it support "n-grams" (e.g. making the length
configurable) instead of trigrams? I actually had the feeling that
penta-grams (pen-tuples or whatever they would be called) would
be better for my usecase (large substring-search in large documents ..
eg. 500 within 3.000.
Larger sizes.. lesser "sensitivity" => Faster lookup .. perhaps my logic
Hm.. or will the knngist stuff help me here by selecting the best using
pentuples from the beginning?
The above comment is actually general to pg_trgm and not to the wildcard
Sent via pgsql-hackers mailing list (email@example.com)
To make changes to your subscription: