[Nutch-general] Re: hi all

Andrzej Bialecki Sun, 02 Apr 2006 17:02:16 -0700

Dan Morrill wrote:

Since you are using Luke to see the index, luke may not have the character
support built in for non utf-8 character sets (meaning gork when you look at
it). I went to the luke site http://www.getopt.org/luke/ to see if they make
mention of the character sets they support, but there is nothing that states

they support any character set.

When you run your search, do you see good characters, or do you see gork?

Luke may not be able to understand the ISO character sets. (Hypothesis).


Hi,

(I'm the guy behind Luke)

Luke uses UTF-8, because that's what Lucene stores in the index. You mayexperience problems with the default font that it uses, i.e. that itdoesn't support all Unicode characters. Please try to change the font(in Settings) and see if it helps.

Another frequent source of garbled characters is when you read theoriginal content using wrong encoding, e.g. if you read a UTF-8 fileusing your native platform encoding like Latin1 or Big5, or the otherway around. Then you get broken characters being encoded to UTF-8, whenLucene writes out the index, and restored from UTF-8 to their brokenform when Luke reads the index....


--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Re: hi all

Reply via email to