[
https://issues.apache.org/jira/browse/NUTCH-984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13021113#comment-13021113
]
Markus Jelsma edited comment on NUTCH-984 at 4/26/11 4:02 PM:
--------------------------------------------------------------
Yes i can test these URL's with tika-parsers 0.9 but what do you want to see?
They seem to be parsed correctly when using the -t option but not when using -h
or -x. The anchors become
<a shape="rect"
href="http://www.site.nl/nieuws/het-laatste-nieuws/overzicht/2/"/>3
So in this case the anchor indeed doesn't contain data and is thus thrown away.
Might be a Tika issue instead!
was (Author: markus17):
Yes i can test these URL's with tika-parsers 0.9 but what do you want to
see? They seem to be parsed correctly when using the -t option but not when
using -h or -x. The anchors become
<a shape="rect"
href="http://www.arriva.nl/nieuws/het-laatste-nieuws/overzicht/2/"/>3
So in this case the anchor indeed doesn't contain data and is thus thrown away.
Might be a Tika issue instead!
> Parse-tika throws some URL's away
> ---------------------------------
>
> Key: NUTCH-984
> URL: https://issues.apache.org/jira/browse/NUTCH-984
> Project: Nutch
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.3, 2.0
> Reporter: Markus Jelsma
> Priority: Critical
> Fix For: 1.3, 2.0
>
>
> For some reason using parse-tika a crawl just wouldn't dive into some website
> news archive. The paging through the news archive is done with simple anchors:
> <div class="page active">1</div> <a href="/nieuws/overzicht/1/"><div
> class="page">2</div> </a> <a href="/nieuws/overzicht/2/"><div
> class="page">3</div> </a>
> I added some logging to DOMContentUtils:
> 2011-04-18 18:26:09,788 INFO tika.DOMContentUtils - Throw away link:
> http://www.site.nl/nieuws/overzicht/1/
> 2011-04-18 18:26:09,788 INFO tika.DOMContentUtils - Throw away link:
> http://www.site.nl/nieuws/overzicht/2/
> 2011-04-18 18:26:09,788 INFO tika.DOMContentUtils - Throw away link:
> http://www.site.nl/nieuws/overzicht/3/
> ...
> Now, this is rather funky. The code for private boolean
> shouldThrowAwayLink(Node node, NodeList children, int childLen, LinkParams
> params) is the same for parse-html and parse-tika. I also tested the two
> parsers between versions 1.2 and 1.3 for the following URL.
> http://news.bbc.co.uk/2/hi/europe/country_profiles/1154019.stm
> 1.2 - parse-tika: 196
> 1.2 - parse-html: 296
> 1.3 - parse-tika: 279
> 1.3 - parse-html: 296
> Something clearly improved in 1.3 but not generating the remaining URL's are
> a blocker for parse-tika in my case. Relevant configurations are the same
> parser.html.outlinks.ignore_tags is not being used. Testing has been done
> with ParserChecker only.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira