I have setup nutch 0.9 and things are working correctly except the
sequence is being converted to: Â
The character encoding in the html pages is windows-1252.
A sample snippet that is converted is:
======= start ============
<td align="center">
<b><font face="Arial" size="0">Address: 120 South
7th Street - Terre Haute, IN 47807</font></b>
</td>
======== end ==============
When I look at the parsed text (using bin/nutch readseg...) it looks like:
======== start ===========
120 South 7th Street - Terre Haute, IN 47807
======== end =============
Is there a way to get the to either be ignored or translated
correctly to the space character?
Thanks,
Chris....
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general