Re: [HACKERS] tsearch2: enable non ascii stop words with C locale

Teodor Sigaev Mon, 12 Feb 2007 06:57:25 -0800

Currently tsearch2 does not accept non ascii stop words if locale is
C. Included patches should fix the problem. Patches against PostgreSQL
8.2.3.


I'm not sure about correctness of patch's description.

First, p_islatin() function is used only in words/lexemes parser, not stop-wordcode. Second, p_islatin() function is used for catching lexemes like URL or HTMLentities, so, it's important to define real latin characters. And it worksright: it calls p_isalpha (already patched for your case), then it callsp_isascii which should be correct for any encodings with C-locale.

Third (and last):
contrib_regression=# show server_encoding;
 server_encoding
-----------------
 UTF8
contrib_regression=# show lc_ctype;
 lc_ctype
----------
 C
contrib_regression=# select lexize('ru_stem_utf8', RUSSIAN_STOP_WORD);
 lexize
--------
 {}

Russian characters with UTF8 take two bytes.



--
Teodor Sigaev                                   E-mail: [EMAIL PROTECTED]
                                                   WWW: http://www.sigaev.ru/

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] tsearch2: enable non ascii stop words with C locale

Reply via email to