[
https://issues.apache.org/jira/browse/NUTCH-1317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13233970#comment-13233970
]
Markus Jelsma commented on NUTCH-1317:
--------------------------------------
I am not sure about the root of the problem. We only use Tika for parsing PDF
and (X)HTML and rely on Boilerpipe. Some HTML pages are quite a thing, full of
stuff or endless tables. You'll press page down over a hundred times to scroll
to the bottom. I've not tested all bad URL's but i think Tika does the job
eventually, if not i'll file a ticket. Most i tested work, given enough time.
HTML pages that take more than one second to parse are considered bad, it
should be less than 50ms on average. Those that are bad usually contain too
much elements and are large in size.
> Max content length by MIME-type
> -------------------------------
>
> Key: NUTCH-1317
> URL: https://issues.apache.org/jira/browse/NUTCH-1317
> Project: Nutch
> Issue Type: Improvement
> Components: parser
> Affects Versions: 1.4
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.5
>
>
> The good old http.content.length directive is not sufficient in large
> internet crawls. For example, a 5MB PDF file may be parsed without issues but
> a 5MB HTML file may time out.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira