prased incorrectly
-------------------------

                 Key: NUTCH-519
                 URL: https://issues.apache.org/jira/browse/NUTCH-519
             Project: Nutch
          Issue Type: Bug
    Affects Versions: 0.9.0
         Environment: Linux 2.6.21
Java 1.5
Nutch 0.9
            Reporter: Chris Hane


I have deployed nutch in a standard configuration without any modifications.

On all of the pages that it is crawling on my website, during the parse phase 
it convertes   html entity into Â.

The charset is set on the page to be: 
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">

When I issue the command

bin/nutch readseg -get demo.crawl/segments/20070718174552/ 
http://demo.itsolut.com/mr.com/bookstore/maintenancemanagement/wiremanlibrary.htm

The HTML portion contains:
    <tr>
      <td align="center">
      <b><font face="Arial" size="0">Address: 120 South
7th Street&nbsp; -&nbsp; Terre Haute, IN 47807</font></b>
      </td>
    </tr>

and the parsed content is:
 Address: 120 South 7th Street  -  Terre Haute, IN 47807

Also, the output contains the following:
Parse Metadata: OriginalCharEncoding=windows-1252 
CharEncodingForConversion=windows-1252


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to