> > Currently tsearch2 does not accept non ascii stop words if locale is > > C. Included patches should fix the problem. Patches against PostgreSQL > > 8.2.3. > > I'm not sure about correctness of patch's description. > > First, p_islatin() function is used only in words/lexemes parser, not > stop-word > code.
I know. My guess is the parser does not read the stop word file at least with default configuration. > Second, p_islatin() function is used for catching lexemes like URL or HTML > entities, so, it's important to define real latin characters. And it works > right: it calls p_isalpha (already patched for your case), then it calls > p_isascii which should be correct for any encodings with C-locale. original p_islatin is defined as follows: static int p_islatin(TParser * prs) { return (p_isalpha(prs) && p_isascii(prs)) ? 1 : 0; } So if a character is not ASCII, it returns 0 even if p_isalpha returns 1. Is this what you expect? > Third (and last): > contrib_regression=# show server_encoding; > server_encoding > ----------------- > UTF8 > contrib_regression=# show lc_ctype; > lc_ctype > ---------- > C > contrib_regression=# select lexize('ru_stem_utf8', RUSSIAN_STOP_WORD); > lexize > -------- > {} > > Russian characters with UTF8 take two bytes. In our case, we added JAPANESE_STOP_WORD into english.stop then: select to_tsvector(JAPANESE_STOP_WORD) which returns words even they are in JAPANESE_STOP_WORD. And with the patches the problem was solved. -- Tatsuo Ishii SRA OSS, Inc. Japan ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq