Parse-tika throws some URL's away
---------------------------------
Key: NUTCH-984
URL: https://issues.apache.org/jira/browse/NUTCH-984
Project: Nutch
Issue Type: Bug
Components: parser
Affects Versions: 1.3, 2.0
Reporter: Markus Jelsma
Priority: Critical
Fix For: 1.3, 2.0
For some reason using parse-tika a crawl just wouldn't dive into some website
news archive. The paging through the news archive is done with simple anchors:
<div class="page active">1</div> <a href="/nieuws/overzicht/1/"><div
class="page">2</div> </a> <a href="/nieuws/overzicht/2/"><div
class="page">3</div> </a>
I added some logging to DOMContentUtils:
2011-04-18 18:26:09,788 INFO tika.DOMContentUtils - Throw away link:
http://www.site.nl/nieuws/overzicht/1/
2011-04-18 18:26:09,788 INFO tika.DOMContentUtils - Throw away link:
http://www.site.nl/nieuws/overzicht/2/
2011-04-18 18:26:09,788 INFO tika.DOMContentUtils - Throw away link:
http://www.site.nl/nieuws/overzicht/3/
...
Now, this is rather funky. The code for private boolean
shouldThrowAwayLink(Node node, NodeList children, int childLen, LinkParams
params) is the same for parse-html and parse-tika. I also tested the two
parsers between versions 1.2 and 1.3 for the following URL.
http://news.bbc.co.uk/2/hi/europe/country_profiles/1154019.stm
1.2 - parse-tika: 196
1.2 - parse-html: 296
1.3 - parse-tika: 279
1.3 - parse-html: 296
Something clearly improved in 1.3 but not generating the remaining URL's are a
blocker for parse-tika in my case.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira