[ 
https://issues.apache.org/jira/browse/NUTCH-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1582:
----------------------------------------

    Description: 
When I do a crawl of these pages with microformats-reltag activated, I get 
loads of garbage included within my dump of the webdb.

http://www.amazon.com/Degree-Antiperspirant-Deodorant-Extreme-Blast/dp/B001ET769Y
http://www.amazon.com/Cisco-WAP4410N-Wireless-N-Access-Point/dp/B001IYCMNA

{code:xml}
metadata Rel-Tag :      
�^A^@^B^@^@^@^Pget_range_slices^@^@^@^B^O^@^@^L^@^@^A^W^K^@^A^@^@^@(com.amazon.www:http/review/RZJZBDJMTYN4Y^O^@^B^L^@^@^@^B^L^@^B^K^@^A^@^@^@^Bil^O^@^B^L^@^@^@^B^K^@^A^@^@^@Jhttp://www.amazon.com/Cisco-WAP4410N-Wireless-N-Access-Point/dp/B001IYCMNA^K^@^B^@^@^@(Horrible
 Device, Two Years of Experience
^@^C^@^D�]����^@^K^@^A^@^@^@Qhttp://www.amazon.com/Degree-Antiperspirant-Deodorant-Extreme-Blast/dp/B001ET769Y^K^@^B^@^@^@(Horrible
 Device, Two Years of Experience
^@^C^@^D�]����^@^@^@^L^@^B^K^@^A^@^@^@^Bmk^O^@^B^L^@^@^@^A^K^@^A^@^@^@^Ddist^K^@^B^@^@^@^A1
{code}


  was:
When I do a crawl of these pages with microformats-reltag activated, I get 
loads of garbage included within my dump of the webdb.

http://www.amazon.com/Degree-Antiperspirant-Deodorant-Extreme-Blast/dp/B001ET769Y
http://www.amazon.com/Cisco-WAP4410N-Wireless-N-Access-Point/dp/B001IYCMNA

{code:xml}
metadata Rel-Tag :      
�^A^@^B^@^@^@^Pget_range_slices^@^@^@^B^O^@^@^L^@^@^A^W^K^@^A^@^@^@(com.amazon.www:http/review/RZJZBDJMTYN4Y^O^@^B^L^@^@^@^B^L^@^B^K^@^A^@^@^@^Bil^O^@^B^L^@^@^@^B^K^@^A^@^@^@Jhttp://www.amazon.com/Cisco-WAP4410N-Wireless-N-Access-Point/dp/B001IYCMNA^K^@^B^@^@^@(Horrible
 Device, Two Years of Experience
^@^C^@^D�]����^@^K^@^A^@^@^@Qhttp://www.amazon.com/Degree-Antiperspirant-Deodorant-Extreme-Blast/dp/B001ET769Y^K^@^B^@^@^@(Horrible
 Device, Two Years of Experience
^@^C^@^D�]����^@^@^@^L^@^B^K^@^A^@^@^@^Bmk^O^@^B^L^@^@^@^A^K^@^A^@^@^@^Ddist^K^@^B^@^@^@^A1
{code:xml}


    
> Garbage when microformats-reltag invoked in 2.x
> -----------------------------------------------
>
>                 Key: NUTCH-1582
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1582
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 2.2
>         Environment: Nutch 2.x HEAD
> gora-core & gora-casssandra 0.3 
>            Reporter: Lewis John McGibbney
>             Fix For: 2.3
>
>
> When I do a crawl of these pages with microformats-reltag activated, I get 
> loads of garbage included within my dump of the webdb.
> http://www.amazon.com/Degree-Antiperspirant-Deodorant-Extreme-Blast/dp/B001ET769Y
> http://www.amazon.com/Cisco-WAP4410N-Wireless-N-Access-Point/dp/B001IYCMNA
> {code:xml}
> metadata Rel-Tag :      
> �^A^@^B^@^@^@^Pget_range_slices^@^@^@^B^O^@^@^L^@^@^A^W^K^@^A^@^@^@(com.amazon.www:http/review/RZJZBDJMTYN4Y^O^@^B^L^@^@^@^B^L^@^B^K^@^A^@^@^@^Bil^O^@^B^L^@^@^@^B^K^@^A^@^@^@Jhttp://www.amazon.com/Cisco-WAP4410N-Wireless-N-Access-Point/dp/B001IYCMNA^K^@^B^@^@^@(Horrible
>  Device, Two Years of Experience
> ^@^C^@^D�]����^@^K^@^A^@^@^@Qhttp://www.amazon.com/Degree-Antiperspirant-Deodorant-Extreme-Blast/dp/B001ET769Y^K^@^B^@^@^@(Horrible
>  Device, Two Years of Experience
> ^@^C^@^D�]����^@^@^@^L^@^B^K^@^A^@^@^@^Bmk^O^@^B^L^@^@^@^A^K^@^A^@^@^@^Ddist^K^@^B^@^@^@^A1
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to