Lewis John McGibbney created NUTCH-1582:
-------------------------------------------
Summary: Garbage when microformats-reltag invoked in 2.x
Key: NUTCH-1582
URL: https://issues.apache.org/jira/browse/NUTCH-1582
Project: Nutch
Issue Type: Bug
Components: parser
Affects Versions: 2.2
Environment: Nutch 2.x HEAD
gora-core & gora-casssandra 0.3
Reporter: Lewis John McGibbney
Fix For: 2.3
When I do a crawl of these pages with microformats-reltag activated, I get
loads of garbage included within my dump of the webdb.
http://www.amazon.com/Degree-Antiperspirant-Deodorant-Extreme-Blast/dp/B001ET769Y
http://www.amazon.com/Cisco-WAP4410N-Wireless-N-Access-Point/dp/B001IYCMNA
{code:xml}
metadata Rel-Tag :
�^A^@^B^@^@^@^Pget_range_slices^@^@^@^B^O^@^@^L^@^@^A^W^K^@^A^@^@^@(com.amazon.www:http/review/RZJZBDJMTYN4Y^O^@^B^L^@^@^@^B^L^@^B^K^@^A^@^@^@^Bil^O^@^B^L^@^@^@^B^K^@^A^@^@^@Jhttp://www.amazon.com/Cisco-WAP4410N-Wireless-N-Access-Point/dp/B001IYCMNA^K^@^B^@^@^@(Horrible
Device, Two Years of Experience
^@^C^@^D�]����^@^K^@^A^@^@^@Qhttp://www.amazon.com/Degree-Antiperspirant-Deodorant-Extreme-Blast/dp/B001ET769Y^K^@^B^@^@^@(Horrible
Device, Two Years of Experience
^@^C^@^D�]����^@^@^@^L^@^B^K^@^A^@^@^@^Bmk^O^@^B^L^@^@^@^A^K^@^A^@^@^@^Ddist^K^@^B^@^@^@^A1
{code:xml}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira