I have setup nutch 0.9 and things are working correctly except the
sequence is being converted to: Â
The character encoding in the html pages is windows-1252.
A sample snippet that is converted is:
======= start ============
<td align="center">
<b><font face="Arial" size="0">Address: 120 South
7th Street - Terre Haute, IN 47807</font></b>
</td>
======== end ==============
When I look at the parsed text (using bin/nutch readseg...) it looks like:
======== start ===========
120 South 7th Street - Terre Haute, IN 47807
======== end =============
Is there a way to get the to either be ignored or translated
correctly to the space character?
Thanks,
Chris....