One possibility is that the user-visible specification is just a name
(eg, "english"), but the actual filename out on the filesystem is,
say, name.encoding.stop (eg, "english.utf8.stop") where we use PG's
names for the encodings.  We could just fail if there's not a file
matching the database encoding, or we could try that and then try
utf8, or some other rule.  In any case I'd want it to verify and
convert encoding as necessary while reading.


I have no strong objection for UTF8-encoded files (stop words or ispell or synonym or thesaurus). Just recode it after reading.

But configuration for different languages might be differ, for example russian (and any cyrillic-based) configuration is differ from west-european configuration based on different character sets. So, we should have non-obvious rules for stemmers to define which exact stemmer and stop-file should be used. For russian language with utf8 encoding it should use for lword english stemmer, but for italian language - italian stemmer. Any ASCII chars can't present in russian word, but might italian word can contains only ASCII.



--
Teodor Sigaev                                   E-mail: [EMAIL PROTECTED]
                                                   WWW: http://www.sigaev.ru/

---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at

               http://www.postgresql.org/about/donate

Reply via email to