Are you sure that the HTMLParser is decoding the page incorrectly?

I've seen Nutch deployments where the characters are correctly decoded
by the HTMLParser and are correct in the Lucene index, but then the
webapp is misconfigured such that they are not displayed correctly on
the search results page.

You can use the Lucene toolkit "luke" to open the index and examine
the contents.  If the stuff in the index is good, then the problem
is not the HTMLParser.


Regards,

Aaron

-- 
Aaron Binns
Senior Software Engineer, Web Group
Internet Archive
aa...@archive.org

Reply via email to