[ 
https://issues.apache.org/jira/browse/NUTCH-519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-519.
-------------------------------

    Resolution: Won't Fix

Bulk close of legacy issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

>   prased incorrectly
> -------------------------
>
>                 Key: NUTCH-519
>                 URL: https://issues.apache.org/jira/browse/NUTCH-519
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.9.0
>         Environment: Linux 2.6.21
> Java 1.5
> Nutch 0.9
>            Reporter: Chris Hane
>
> I have deployed nutch in a standard configuration without any modifications.
> On all of the pages that it is crawling on my website, during the parse phase 
> it convertes   html entity into Â.
> The charset is set on the page to be: 
> <meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
> When I issue the command
> bin/nutch readseg -get demo.crawl/segments/20070718174552/ 
> http://demo.itsolut.com/mr.com/bookstore/maintenancemanagement/wiremanlibrary.htm
> The HTML portion contains:
>     <tr>
>       <td align="center">
>       <b><font face="Arial" size="0">Address: 120 South
> 7th Street&nbsp; -&nbsp; Terre Haute, IN 47807</font></b>
>       </td>
>     </tr>
> and the parsed content is:
>  Address: 120 South 7th Street  -  Terre Haute, IN 47807
> Also, the output contains the following:
> Parse Metadata: OriginalCharEncoding=windows-1252 
> CharEncodingForConversion=windows-1252

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to