[
https://issues.apache.org/jira/browse/NUTCH-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Lewis John McGibbney updated NUTCH-1582:
----------------------------------------
Description:
When I do a crawl of these pages with microformats-reltag activated, I get
loads of garbage included within my dump of the webdb.
http://www.amazon.com/Degree-Antiperspirant-Deodorant-Extreme-Blast/dp/B001ET769Y
http://www.amazon.com/Cisco-WAP4410N-Wireless-N-Access-Point/dp/B001IYCMNA
{code:xml}
metadata Rel-Tag :
�^A^@^B^@^@^@^Pget_range_slices^@^@^@^B^O^@^@^L^@^@^A^W^K^@^A^@^@^@(com.amazon.www:http/review/RZJZBDJMTYN4Y^O^@^B^L^@^@^@^B^L^@^B^K^@^A^@^@^@^Bil^O^@^B^L^@^@^@^B^K^@^A^@^@^@Jhttp://www.amazon.com/Cisco-WAP4410N-Wireless-N-Access-Point/dp/B001IYCMNA^K^@^B^@^@^@(Horrible
Device, Two Years of Experience
^@^C^@^D�]����^@^K^@^A^@^@^@Qhttp://www.amazon.com/Degree-Antiperspirant-Deodorant-Extreme-Blast/dp/B001ET769Y^K^@^B^@^@^@(Horrible
Device, Two Years of Experience
^@^C^@^D�]����^@^@^@^L^@^B^K^@^A^@^@^@^Bmk^O^@^B^L^@^@^@^A^K^@^A^@^@^@^Ddist^K^@^B^@^@^@^A1
{code}
was:
When I do a crawl of these pages with microformats-reltag activated, I get
loads of garbage included within my dump of the webdb.
http://www.amazon.com/Degree-Antiperspirant-Deodorant-Extreme-Blast/dp/B001ET769Y
http://www.amazon.com/Cisco-WAP4410N-Wireless-N-Access-Point/dp/B001IYCMNA
{code:xml}
metadata Rel-Tag :
�^A^@^B^@^@^@^Pget_range_slices^@^@^@^B^O^@^@^L^@^@^A^W^K^@^A^@^@^@(com.amazon.www:http/review/RZJZBDJMTYN4Y^O^@^B^L^@^@^@^B^L^@^B^K^@^A^@^@^@^Bil^O^@^B^L^@^@^@^B^K^@^A^@^@^@Jhttp://www.amazon.com/Cisco-WAP4410N-Wireless-N-Access-Point/dp/B001IYCMNA^K^@^B^@^@^@(Horrible
Device, Two Years of Experience
^@^C^@^D�]����^@^K^@^A^@^@^@Qhttp://www.amazon.com/Degree-Antiperspirant-Deodorant-Extreme-Blast/dp/B001ET769Y^K^@^B^@^@^@(Horrible
Device, Two Years of Experience
^@^C^@^D�]����^@^@^@^L^@^B^K^@^A^@^@^@^Bmk^O^@^B^L^@^@^@^A^K^@^A^@^@^@^Ddist^K^@^B^@^@^@^A1
{code:xml}
> Garbage when microformats-reltag invoked in 2.x
> -----------------------------------------------
>
> Key: NUTCH-1582
> URL: https://issues.apache.org/jira/browse/NUTCH-1582
> Project: Nutch
> Issue Type: Bug
> Components: parser
> Affects Versions: 2.2
> Environment: Nutch 2.x HEAD
> gora-core & gora-casssandra 0.3
> Reporter: Lewis John McGibbney
> Fix For: 2.3
>
>
> When I do a crawl of these pages with microformats-reltag activated, I get
> loads of garbage included within my dump of the webdb.
> http://www.amazon.com/Degree-Antiperspirant-Deodorant-Extreme-Blast/dp/B001ET769Y
> http://www.amazon.com/Cisco-WAP4410N-Wireless-N-Access-Point/dp/B001IYCMNA
> {code:xml}
> metadata Rel-Tag :
> �^A^@^B^@^@^@^Pget_range_slices^@^@^@^B^O^@^@^L^@^@^A^W^K^@^A^@^@^@(com.amazon.www:http/review/RZJZBDJMTYN4Y^O^@^B^L^@^@^@^B^L^@^B^K^@^A^@^@^@^Bil^O^@^B^L^@^@^@^B^K^@^A^@^@^@Jhttp://www.amazon.com/Cisco-WAP4410N-Wireless-N-Access-Point/dp/B001IYCMNA^K^@^B^@^@^@(Horrible
> Device, Two Years of Experience
> ^@^C^@^D�]����^@^K^@^A^@^@^@Qhttp://www.amazon.com/Degree-Antiperspirant-Deodorant-Extreme-Blast/dp/B001ET769Y^K^@^B^@^@^@(Horrible
> Device, Two Years of Experience
> ^@^C^@^D�]����^@^@^@^L^@^B^K^@^A^@^@^@^Bmk^O^@^B^L^@^@^@^A^K^@^A^@^@^@^Ddist^K^@^B^@^@^@^A1
> {code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira