Lewis John McGibbney created NUTCH-1582:
-------------------------------------------

             Summary: Garbage when microformats-reltag invoked in 2.x
                 Key: NUTCH-1582
                 URL: https://issues.apache.org/jira/browse/NUTCH-1582
             Project: Nutch
          Issue Type: Bug
          Components: parser
    Affects Versions: 2.2
         Environment: Nutch 2.x HEAD
gora-core & gora-casssandra 0.3 
            Reporter: Lewis John McGibbney
             Fix For: 2.3


When I do a crawl of these pages with microformats-reltag activated, I get 
loads of garbage included within my dump of the webdb.

http://www.amazon.com/Degree-Antiperspirant-Deodorant-Extreme-Blast/dp/B001ET769Y
http://www.amazon.com/Cisco-WAP4410N-Wireless-N-Access-Point/dp/B001IYCMNA

{code:xml}
metadata Rel-Tag :      
�^A^@^B^@^@^@^Pget_range_slices^@^@^@^B^O^@^@^L^@^@^A^W^K^@^A^@^@^@(com.amazon.www:http/review/RZJZBDJMTYN4Y^O^@^B^L^@^@^@^B^L^@^B^K^@^A^@^@^@^Bil^O^@^B^L^@^@^@^B^K^@^A^@^@^@Jhttp://www.amazon.com/Cisco-WAP4410N-Wireless-N-Access-Point/dp/B001IYCMNA^K^@^B^@^@^@(Horrible
 Device, Two Years of Experience
^@^C^@^D�]����^@^K^@^A^@^@^@Qhttp://www.amazon.com/Degree-Antiperspirant-Deodorant-Extreme-Blast/dp/B001ET769Y^K^@^B^@^@^@(Horrible
 Device, Two Years of Experience
^@^C^@^D�]����^@^@^@^L^@^B^K^@^A^@^@^@^Bmk^O^@^B^L^@^@^@^A^K^@^A^@^@^@^Ddist^K^@^B^@^@^@^A1
{code:xml}


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to