prased incorrectly ------------------------- Key: NUTCH-519 URL: https://issues.apache.org/jira/browse/NUTCH-519 Project: Nutch Issue Type: Bug Affects Versions: 0.9.0 Environment: Linux 2.6.21 Java 1.5 Nutch 0.9 Reporter: Chris Hane
I have deployed nutch in a standard configuration without any modifications. On all of the pages that it is crawling on my website, during the parse phase it convertes html entity into Â. The charset is set on the page to be: <meta http-equiv="Content-Type" content="text/html; charset=windows-1252"> When I issue the command bin/nutch readseg -get demo.crawl/segments/20070718174552/ http://demo.itsolut.com/mr.com/bookstore/maintenancemanagement/wiremanlibrary.htm The HTML portion contains: <tr> <td align="center"> <b><font face="Arial" size="0">Address: 120 South 7th Street - Terre Haute, IN 47807</font></b> </td> </tr> and the parsed content is: Address: 120 South 7th Street - Terre Haute, IN 47807 Also, the output contains the following: Parse Metadata: OriginalCharEncoding=windows-1252 CharEncodingForConversion=windows-1252 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers