Any suggestions? I finally modified the
org.apache.nutch.parse.html.HtmlParser to remove the from the input
stream before passing it to the NekoHTML or TagSoup parsers (both have this
issue).
I also opened a JIRA so that this issue isn't lost:
https://issues.apache.org/jira/browse/NUTCH-519
Chris....
Chris Hane wrote:
I have setup nutch 0.9 and things are working correctly except the
sequence is being converted to: Â
The character encoding in the html pages is windows-1252.
A sample snippet that is converted is:
======= start ============
<td align="center">
<b><font face="Arial" size="0">Address: 120 South
7th Street - Terre Haute, IN 47807</font></b>
</td>
======== end ==============
When I look at the parsed text (using bin/nutch readseg...) it looks like:
======== start ===========
120 South 7th Street - Terre Haute, IN 47807
======== end =============
Is there a way to get the to either be ignored or translated
correctly to the space character?
Thanks,
Chris....