Re: [PATCHES] a tsearch2 (8.2.4) dictionary that only filters out stopwords

Jan Urbański Fri, 09 Nov 2007 08:58:58 -0800

> This example still doesn't seem very convincing --- why would you not
> merely attach the stopword list to the pl_ispell dictionary?


Because the ispell-based dictionaries first stem the lexeme and then
search for it in the stopwords file. The situation here is that a
stopword is first stemmed to produce another lexeme (which is not in the
stopwords file, as it's a perfectly valid word) and then gets indexed,
instead of being discarded.
To restate: the word 'od' in Polish is both a preposition and a declined
form of the noun 'oda'. The ispell dictionary when passed the lexeme
'od' first stems it to produce 'oda' and then fails to find it in the
stopwords file. If I'd include the word 'oda' in the stopwords file, I'd
be losing information about the noun 'oda' appearing in documents.

I'm still trying to find an English example, as I'm sure it would be
easier to understand by most readers of this list. Nothing comes to my
mind, however - I guess some languages just have rotten luck with their
grammar.

> If there is a use-case for it, IMHO it'd be better to add a boolean
> accept-or-pass-on parameter to the "simple" dictionary than to add a
> whole new dictionary type.

Ah, I never thought of it. You may be very right - it does look like an
easier solution. However, it would require coding some basic parsing
logic into the dex_init procedure, because right now the 'simple'
dictionary expects dict_initoption to be a path to the stopwords file.
Do you mean something like 'StopFile="/path/to/stopwords",
AcceptUnknown=0'" ?

Regards,
Jan Urbanski
-- 
Jan Urbanski
GPG key ID: E583D7D2

ouden estin

signature.asc
Description: OpenPGP digital signature

Re: [PATCHES] a tsearch2 (8.2.4) dictionary that only filters out stopwords

Reply via email to