Dan Morrill wrote:
Since you are using Luke to see the index, luke may not have the character
support built in for non utf-8 character sets (meaning gork when you look at
it). I went to the luke site http://www.getopt.org/luke/ to see if they make
mention of the character sets they support, but there is nothing that states
they support any character set.
When you run your search, do you see good characters, or do you see gork?
Luke may not be able to understand the ISO character sets. (Hypothesis).
Hi,
(I'm the guy behind Luke)
Luke uses UTF-8, because that's what Lucene stores in the index. You may
experience problems with the default font that it uses, i.e. that it
doesn't support all Unicode characters. Please try to change the font
(in Settings) and see if it helps.
Another frequent source of garbled characters is when you read the
original content using wrong encoding, e.g. if you read a UTF-8 file
using your native platform encoding like Latin1 or Big5, or the other
way around. Then you get broken characters being encoded to UTF-8, when
Lucene writes out the index, and restored from UTF-8 to their broken
form when Luke reads the index....
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general