[ 
https://issues.apache.org/jira/browse/NUTCH-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1387:
----------------------------------------

    Fix Version/s: 2.2
                   1.7
    
> All parsers should respond to cancellation / interrupts.
> --------------------------------------------------------
>
>                 Key: NUTCH-1387
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1387
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>            Reporter: Ferdy Galema
>             Fix For: 1.7, 2.2
>
>
> During parsing a TimeoutException can occur. This is caused whenever the 
> FutureTask.get() cannot be completed within the specified timeout. The tricky 
> part is that single urls might be perfectly able to complete within the 
> timeout, but when there is a heavy concurrent load (a lot of semi-expensive 
> parses) the parser load might stack up and cause many parses to timeout. This 
> can be the case with parsing during fetch. But when using a separate 
> parserjob this can also happen because Parser implementation do not 
> necessarily have to respond to a thread interrupt. (Which is fired away with 
> the task.cancel(true) call). If a parser does not check the 
> Thread.interrupted state at regular intervals, it will just continue to run 
> and eat up resources. I find it very helpful to debug stalling 
> fetchers/parsers with the lazy men's profiler: kill -QUIT <process_id>. This 
> will dump stacktraces, sometimes exposing the fact that hundreds of parser 
> threads are still active in the background. (Of course many of them already 
> timed out a long time ago).
> To fix this, every parser should check it's interrupted state at regular 
> intervals. (For example an html parse might be stuck walking the DOM tree, so 
> checking after every Nth element would be an appropiate moment.)
> This issue is for reference first. Fixing it all at once would be a huge task.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to