prased incorrectly
-------------------------

                 Key: NUTCH-519
                 URL: https://issues.apache.org/jira/browse/NUTCH-519
             Project: Nutch
          Issue Type: Bug
    Affects Versions: 0.9.0
         Environment: Linux 2.6.21
Java 1.5
Nutch 0.9
            Reporter: Chris Hane


I have deployed nutch in a standard configuration without any modifications.

On all of the pages that it is crawling on my website, during the parse phase 
it convertes   html entity into Â.

The charset is set on the page to be: 
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">

When I issue the command

bin/nutch readseg -get demo.crawl/segments/20070718174552/ 
http://demo.itsolut.com/mr.com/bookstore/maintenancemanagement/wiremanlibrary.htm

The HTML portion contains:
    <tr>
      <td align="center">
      <b><font face="Arial" size="0">Address: 120 South
7th Street&nbsp; -&nbsp; Terre Haute, IN 47807</font></b>
      </td>
    </tr>

and the parsed content is:
 Address: 120 South 7th Street  -  Terre Haute, IN 47807

Also, the output contains the following:
Parse Metadata: OriginalCharEncoding=windows-1252 
CharEncodingForConversion=windows-1252


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to