Re: Charset encoding

Andy Liu Wed, 18 May 2005 06:19:11 -0700

Sometimes web pages do not identify the encoding the page is in.  In
these cases, the client has to "guess" the encoding.  Nutch currently
does not have a guessing algorithm, so if it encounters one of these
pages, it just decodes the page using the
parser.character.encoding.default parameter.

Probably the best thing to do is to port over Mozilla's algorithm.  I
know there's a port called jcharset, but I've tested it a few times
and it does not seem very accurate for reasons unknown.  I haven't had
that chance to dig in too deeply into the issue.

On 5/18/05, k-team <[EMAIL PROTECTED]> wrote:
> hi guys,
> 
>             we have indexed some pages and noticed that the results of
> the search are not interpreted correctly by our browser. the encoding
> in search.jsp is utf-8 and the browser is set to utf-8 encoding, but
> we obtain strange chars.
> 
> we have also set parser.character.encoding.default in
> nutch-default.xml to utf-8.
> 
> anyone knows what we are missing?
> 
> ciao,
> KTeam
>

Re: Charset encoding

Reply via email to