Hola Juan,
On 08/21/2008 at 1:16 PM, Juan Pablo Morales wrote:
> I have an index in Spanish and I use Snowball to stem and
> analyze and it works perfectly. However, I am running into
> trouble storing (not indexing, only storing) words that
> have special characters.
>
> That is, I store the special character but the it comes
> garbled when I read it back.
> To provide an example:
>
> String content = "niƱos";
> document.add(new Field("name",content,Store.YES, Index.Tokenized));
> writer.addDocument(doc, new SnowballAnalyzer("Spanish"));
If your source code is encoded as Latin-1, then it will likely appear to you to
be the correct character (depending on the editor/viewer you're using and its
configuration), but Java may not properly convert it to Unicode, depending on
the encoding it expects your source code to be in (see the -encoding option to
javac - if you don't specify it, then the platform default encoding is used).
You could test whether this is the problem by instead trying:
String content = "ni\u00F1os";
...
> Looking at the index with Luke it shows me "ni�os" but
> when I want to see the full text (by right clicking) it shows
> me ni os.
� is the Unicode replacement character (U+FFFD), and it's routinely
used, including within Lucene itself, as the substitute character for byte
sequences that are not valid in the designated source encoding.
> I know Lucene is supposed to store fields in UTF8, but then,
> how can I make sure I sotre something and get it back just as
> it was, including special characters?
Make sure that the data you give to Lucene is encoded properly, and then what
you get back should also be.
Please try the suggestion I gave you above ("ni\u00F1os"). If you still have
the same problem, you may have found a bug - please report back what you find.
Steve