[ 
https://issues.apache.org/jira/browse/NUTCH-984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13021099#comment-13021099
 ] 

Julien Nioche commented on NUTCH-984:
-------------------------------------

Could you test the URLs above directly with Tika 0.9? I suppose this has to do 
with the default mappers used by Tika which we can override from Nutch.

BTW this illustrates why parse-html is still the default option for html and 
parse-tika is used for the other mime-types. I'd suggest that we mark this as 
fixed in 2.0 as 1.3 is about to be RCed. More generally the tests that are used 
for checking the html parsing need to be ported to parse-tika as well




> Parse-tika throws some URL's away
> ---------------------------------
>
>                 Key: NUTCH-984
>                 URL: https://issues.apache.org/jira/browse/NUTCH-984
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.3, 2.0
>            Reporter: Markus Jelsma
>            Priority: Critical
>             Fix For: 1.3, 2.0
>
>
> For some reason using parse-tika a crawl just wouldn't dive into some website 
> news archive. The paging through the news archive is done with simple anchors:
> <div class="page active">1</div> <a href="/nieuws/overzicht/1/"><div 
> class="page">2</div> </a> <a href="/nieuws/overzicht/2/"><div 
> class="page">3</div> </a>
> I added some logging to DOMContentUtils:
> 2011-04-18 18:26:09,788 INFO  tika.DOMContentUtils - Throw away link:  
> http://www.site.nl/nieuws/overzicht/1/
> 2011-04-18 18:26:09,788 INFO  tika.DOMContentUtils - Throw away link:  
> http://www.site.nl/nieuws/overzicht/2/
> 2011-04-18 18:26:09,788 INFO  tika.DOMContentUtils - Throw away link:  
> http://www.site.nl/nieuws/overzicht/3/
> ...
> Now, this is rather funky. The code for private boolean 
> shouldThrowAwayLink(Node node, NodeList children, int childLen, LinkParams 
> params) is the same for parse-html and parse-tika. I also tested the two 
> parsers between versions 1.2 and 1.3 for the following URL.
> http://news.bbc.co.uk/2/hi/europe/country_profiles/1154019.stm
>  1.2 - parse-tika: 196
>  1.2 - parse-html: 296
>  1.3 - parse-tika: 279
>  1.3 - parse-html: 296
> Something clearly improved in 1.3 but not generating the remaining URL's are 
> a blocker for parse-tika in my case. Relevant configurations are the same 
> parser.html.outlinks.ignore_tags is not being used. Testing has been done 
> with ParserChecker only.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to