[
https://issues.apache.org/jira/browse/NUTCH-519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma closed NUTCH-519.
-------------------------------
Resolution: Won't Fix
Bulk close of legacy issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira
> prased incorrectly
> -------------------------
>
> Key: NUTCH-519
> URL: https://issues.apache.org/jira/browse/NUTCH-519
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 0.9.0
> Environment: Linux 2.6.21
> Java 1.5
> Nutch 0.9
> Reporter: Chris Hane
>
> I have deployed nutch in a standard configuration without any modifications.
> On all of the pages that it is crawling on my website, during the parse phase
> it convertes html entity into Â.
> The charset is set on the page to be:
> <meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
> When I issue the command
> bin/nutch readseg -get demo.crawl/segments/20070718174552/
> http://demo.itsolut.com/mr.com/bookstore/maintenancemanagement/wiremanlibrary.htm
> The HTML portion contains:
> <tr>
> <td align="center">
> <b><font face="Arial" size="0">Address: 120 South
> 7th Street - Terre Haute, IN 47807</font></b>
> </td>
> </tr>
> and the parsed content is:
> Address: 120 South 7th Street - Terre Haute, IN 47807
> Also, the output contains the following:
> Parse Metadata: OriginalCharEncoding=windows-1252
> CharEncodingForConversion=windows-1252
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira