Re: nbsp converted to funky character

Chris Hane Wed, 18 Jul 2007 14:56:57 -0700

Any suggestions? I finally modified theorg.apache.nutch.parse.html.HtmlParser to remove the   from the inputstream before passing it to the NekoHTML or TagSoup parsers (both have thisissue).


I also opened a JIRA so that this issue isn't lost:
https://issues.apache.org/jira/browse/NUTCH-519


Chris....

Chris Hane wrote:

I have setup nutch 0.9 and things are working correctly except the  sequence is being converted to: Â


The character encoding in the html pages is windows-1252.

A sample snippet that is converted is:

======= start ============
<td align="center">

      <b><font face="Arial" size="0">Address: 120 South
7th Street&nbsp; -&nbsp; Terre Haute, IN 47807</font></b>
      </td>
======== end ==============

When I look at the parsed text (using bin/nutch readseg...) it looks like:

======== start ===========
120 South 7th StreetÂ  -Â  Terre Haute, IN 47807
======== end =============

Is there a way to get the   to either be ignored or translatedcorrectly to the space character?


Thanks,
Chris....

Re: nbsp converted to funky character

Reply via email to