[ https://issues.apache.org/jira/browse/NUTCH-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Lewis John McGibbney updated NUTCH-1387: ---------------------------------------- Fix Version/s: 2.2 1.7 > All parsers should respond to cancellation / interrupts. > -------------------------------------------------------- > > Key: NUTCH-1387 > URL: https://issues.apache.org/jira/browse/NUTCH-1387 > Project: Nutch > Issue Type: Bug > Components: parser > Reporter: Ferdy Galema > Fix For: 1.7, 2.2 > > > During parsing a TimeoutException can occur. This is caused whenever the > FutureTask.get() cannot be completed within the specified timeout. The tricky > part is that single urls might be perfectly able to complete within the > timeout, but when there is a heavy concurrent load (a lot of semi-expensive > parses) the parser load might stack up and cause many parses to timeout. This > can be the case with parsing during fetch. But when using a separate > parserjob this can also happen because Parser implementation do not > necessarily have to respond to a thread interrupt. (Which is fired away with > the task.cancel(true) call). If a parser does not check the > Thread.interrupted state at regular intervals, it will just continue to run > and eat up resources. I find it very helpful to debug stalling > fetchers/parsers with the lazy men's profiler: kill -QUIT <process_id>. This > will dump stacktraces, sometimes exposing the fact that hundreds of parser > threads are still active in the background. (Of course many of them already > timed out a long time ago). > To fix this, every parser should check it's interrupted state at regular > intervals. (For example an html parse might be stuck walking the DOM tree, so > checking after every Nth element would be an appropiate moment.) > This issue is for reference first. Fixing it all at once would be a huge task. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira