Sometimes web pages do not identify the encoding the page is in. In these cases, the client has to "guess" the encoding. Nutch currently does not have a guessing algorithm, so if it encounters one of these pages, it just decodes the page using the parser.character.encoding.default parameter.
Probably the best thing to do is to port over Mozilla's algorithm. I know there's a port called jcharset, but I've tested it a few times and it does not seem very accurate for reasons unknown. I haven't had that chance to dig in too deeply into the issue. On 5/18/05, k-team <[EMAIL PROTECTED]> wrote: > hi guys, > > we have indexed some pages and noticed that the results of > the search are not interpreted correctly by our browser. the encoding > in search.jsp is utf-8 and the browser is set to utf-8 encoding, but > we obtain strange chars. > > we have also set parser.character.encoding.default in > nutch-default.xml to utf-8. > > anyone knows what we are missing? > > ciao, > KTeam >
