RE: Storing special characters in Lucene

Steven A Rowe Thu, 21 Aug 2008 10:47:49 -0700

Hola Juan,

On 08/21/2008 at 1:16 PM, Juan Pablo Morales wrote:
> I have an index in Spanish and I use Snowball to stem and
> analyze and it works perfectly. However, I am running into
> trouble storing (not indexing, only storing) words that
> have special characters.
> 
> That is, I store the special character but the it comes
> garbled when I read it back.
> To provide an example:
> 
> String content = "niños";
> document.add(new Field("name",content,Store.YES, Index.Tokenized));
> writer.addDocument(doc, new SnowballAnalyzer("Spanish"));


If your source code is encoded as Latin-1, then it will likely appear to you to 
be the correct character (depending on the editor/viewer you're using and its 
configuration), but Java may not properly convert it to Unicode, depending on 
the encoding it expects your source code to be in (see the -encoding option to 
javac - if you don't specify it, then the platform default encoding is used).  
You could test whether this is the problem by instead trying:

   String content = "ni\u00F1os";
   ...

> Looking at the index with Luke it shows me "ni&#65533;os" but
> when I want to see the full text (by right clicking) it shows
> me ni os.

&#65533; is the Unicode replacement character (U+FFFD), and it's routinely 
used, including within Lucene itself, as the substitute character for byte 
sequences that are not valid in the designated source encoding.

> I know Lucene is supposed to store fields in UTF8, but then,
> how can I make sure I sotre something and get it back just as
> it was, including special characters?

Make sure that the data you give to Lucene is encoded properly, and then what 
you get back should also be.

Please try the suggestion I gave you above ("ni\u00F1os").  If you still have 
the same problem, you may have found a bug - please report back what you find.

Steve

RE: Storing special characters in Lucene

Reply via email to