prased incorrectly
-------------------------
Key: NUTCH-519
URL: https://issues.apache.org/jira/browse/NUTCH-519
Project: Nutch
Issue Type: Bug
Affects Versions: 0.9.0
Environment: Linux 2.6.21
Java 1.5
Nutch 0.9
Reporter: Chris Hane
I have deployed nutch in a standard configuration without any modifications.
On all of the pages that it is crawling on my website, during the parse phase
it convertes html entity into Â.
The charset is set on the page to be:
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
When I issue the command
bin/nutch readseg -get demo.crawl/segments/20070718174552/
http://demo.itsolut.com/mr.com/bookstore/maintenancemanagement/wiremanlibrary.htm
The HTML portion contains:
<tr>
<td align="center">
<b><font face="Arial" size="0">Address: 120 South
7th Street - Terre Haute, IN 47807</font></b>
</td>
</tr>
and the parsed content is:
Address: 120 South 7th Street - Terre Haute, IN 47807
Also, the output contains the following:
Parse Metadata: OriginalCharEncoding=windows-1252
CharEncodingForConversion=windows-1252
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers