Re: encoding

John Haxby Thu, 26 Jan 2006 09:18:02 -0800

arnaudbuffet wrote:

if I try to index a text file encoded in Western 1252 for exemple with the Turkish text 
"düzenlediğimiz kampanyamıza" the lucene index will contain re encoded data with 
&#0;&#17;k&#0;&#0; ....

ISOLatin1AccentFilter.removeAccents() converts that string to
"duzenlediğimiz kampanyamıza" The g-breve and the dotless-i are
untouched. My AsciiDecomposeFilter.decompose() converts the string to
"duzenledigimiz kampanyamiza".


However, since you're seeing those rather odd entities, it looks as
though you're not actually indexing what you think you're indexing. As
Erik says, you need to make sure that you're reading files with the
proper encoding and removing accent and adding dots won't help.

jch



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: encoding

Reply via email to