Dan Morrill wrote:
Since you are using Luke to see the index, luke may not have the character
support built in for non utf-8 character sets (meaning gork when you look at
it). I went to the luke site http://www.getopt.org/luke/ to see if they make
mention of the character sets they support, but there is nothing that states
they support any character set.
When you run your search, do you see good characters, or do you see gork?
Luke may not be able to understand the ISO character sets. (Hypothesis).

Hi,

(I'm the guy behind Luke)

Luke uses UTF-8, because that's what Lucene stores in the index. You may experience problems with the default font that it uses, i.e. that it doesn't support all Unicode characters. Please try to change the font (in Settings) and see if it helps.

Another frequent source of garbled characters is when you read the original content using wrong encoding, e.g. if you read a UTF-8 file using your native platform encoding like Latin1 or Big5, or the other way around. Then you get broken characters being encoded to UTF-8, when Lucene writes out the index, and restored from UTF-8 to their broken form when Luke reads the index....

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to