One more thing, I'm running on Linux 2.6.21 with en_US ISO-8859-1 charset
for the OS. Could that make a difference?
Thanks for any help.
Chris....
Chris Hane wrote:
I have setup nutch 0.9 and things are working correctly except the
sequence is being converted to: Â
The character encoding in the html pages is windows-1252.
A sample snippet that is converted is:
======= start ============
<td align="center">
<b><font face="Arial" size="0">Address: 120 South
7th Street - Terre Haute, IN 47807</font></b>
</td>
======== end ==============
When I look at the parsed text (using bin/nutch readseg...) it looks like:
======== start ===========
120 South 7th Street - Terre Haute, IN 47807
======== end =============
Is there a way to get the to either be ignored or translated
correctly to the space character?
Thanks,
Chris....