I have setup nutch 0.9 and things are working correctly except the   sequence is being converted to: Â

The character encoding in the html pages is windows-1252.

A sample snippet that is converted is:

======= start ============
<td align="center">

      <b><font face="Arial" size="0">Address: 120 South
7th Street&nbsp; -&nbsp; Terre Haute, IN 47807</font></b>
      </td>
======== end ==============

When I look at the parsed text (using bin/nutch readseg...) it looks like:

======== start ===========
120 South 7th Street  -  Terre Haute, IN 47807
======== end =============

Is there a way to get the &nbsp; to either be ignored or translated correctly to the space character?

Thanks,
Chris....

Reply via email to