[
https://issues.apache.org/jira/browse/NUTCH-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Lewis John McGibbney updated NUTCH-1387:
----------------------------------------
Fix Version/s: 2.2
1.7
> All parsers should respond to cancellation / interrupts.
> --------------------------------------------------------
>
> Key: NUTCH-1387
> URL: https://issues.apache.org/jira/browse/NUTCH-1387
> Project: Nutch
> Issue Type: Bug
> Components: parser
> Reporter: Ferdy Galema
> Fix For: 1.7, 2.2
>
>
> During parsing a TimeoutException can occur. This is caused whenever the
> FutureTask.get() cannot be completed within the specified timeout. The tricky
> part is that single urls might be perfectly able to complete within the
> timeout, but when there is a heavy concurrent load (a lot of semi-expensive
> parses) the parser load might stack up and cause many parses to timeout. This
> can be the case with parsing during fetch. But when using a separate
> parserjob this can also happen because Parser implementation do not
> necessarily have to respond to a thread interrupt. (Which is fired away with
> the task.cancel(true) call). If a parser does not check the
> Thread.interrupted state at regular intervals, it will just continue to run
> and eat up resources. I find it very helpful to debug stalling
> fetchers/parsers with the lazy men's profiler: kill -QUIT <process_id>. This
> will dump stacktraces, sometimes exposing the fact that hundreds of parser
> threads are still active in the background. (Of course many of them already
> timed out a long time ago).
> To fix this, every parser should check it's interrupted state at regular
> intervals. (For example an html parse might be stuck walking the DOM tree, so
> checking after every Nth element would be an appropiate moment.)
> This issue is for reference first. Fixing it all at once would be a huge task.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira