> Should we check for stop words before stemming or after ?

Current implementation supports both variants. Look dictionary interface 
definition in morph.c:

typedef struct
{
         char            localename[NAMEDATALEN];
         /* init dictionary */
         void       *(*init) (void);
         /* close dictionary */
         void            (*close) (void *);
         /* find in dictionary */
         char       *(*lemmatize) (void *, char *, int *);
         int                     (*is_stoplemm) (void *, char *, int);
         int                     (*is_stemstoplemm) (void *, char *, int);
}       DICT;

'is_stoplemm'  method is called before 'lemmtize' and 'is_stemstoplemm' after.
dict/porter_english.dct at the end:
TABLE_DICT_START
         "C",
         setup_english_stemmer,
         closedown_english_stemmer,
         engstemming,
         NULL,
         is_stopengword
TABLE_DICT_END

dict/russian_stemming.dct:
TABLE_DICT_START
         "ru_RU.KOI8-R",
         NULL,
         NULL,
         ru_RUKOI8R_stem,
         ru_RUKOI8R_is_stopword,
         NULL
TABLE_DICT_END

So english stemmer defines is lexem stop or not after stemming, but russian before.



-- 
Teodor Sigaev
[EMAIL PROTECTED]



---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Reply via email to