[ 
https://issues.apache.org/jira/browse/NUTCH-1317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13233970#comment-13233970
 ] 

Markus Jelsma commented on NUTCH-1317:
--------------------------------------

I am not sure about the root of the problem. We only use Tika for parsing PDF 
and (X)HTML and rely on Boilerpipe. Some HTML pages are quite a thing, full of 
stuff or endless tables. You'll press page down over a hundred times to scroll 
to the bottom. I've not tested all bad URL's but i think Tika does the job 
eventually, if not i'll file a ticket. Most i tested work, given enough time.
HTML pages that take more than one second to parse are considered bad, it 
should be less than 50ms on average. Those that are bad usually contain too 
much elements and are large in size.
                
> Max content length by MIME-type
> -------------------------------
>
>                 Key: NUTCH-1317
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1317
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.4
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.5
>
>
> The good old http.content.length directive is not sufficient in large 
> internet crawls. For example, a 5MB PDF file may be parsed without issues but 
> a 5MB HTML file may time out.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to