[ 
https://issues.apache.org/jira/browse/NUTCH-1356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13285510#comment-13285510
 ] 

Ferdy Galema commented on NUTCH-1356:
-------------------------------------

I find it difficult to believe those exceptions are caused by this patch. It 
does not change the way exceptions/timeouts are handled, it only makes sure 
parser threads are reused. 

It seems you are suffering from two types of (unrelated) exceptions. The first 
is ExecutionException. This is caused whenever the execution inside the 
FutureTask.get() throws an exception that is not catched anywere but the 
FutureTask.get() itself. In your case this seems to be a NPE during the parse 
of the html page. Might be a bug but then again it is strange that it is not 
reproducible with the ParserChecker. (You sure about this?)

The second is TimeoutException, caused whenever the FutureTask.get() cannot be 
completed within the specified timeout. The tricky part is that single urls 
might be perfectly able to complete within the timeout, but when there is a 
heavy concurrent load (a lot of semi-expensive parses) the parser load might 
stack up and cause many parses to timeout. This can be the case with parsing 
during fetch. But when using a separate parserjob this can also happen because 
Parser implementation do not necessarily have to respond to a thread interrupt. 
(Which is fired away with the task.cancel(true) call). If a parser does not 
check the Thread.interrupted state at regular intervals, it will just continue 
to run and eat up resources. I find it very helpful to debug stalling 
fetchers/parsers with the lazy men's profiler: kill -QUIT <process_id>. This 
will dump stacktraces, sometimes exposing the fact that hundreds of parser 
threads are still active in the background. (Of course many of them already 
timed out a long time ago).
                
> ParseUtil use ExecutorService instead of manually thread handling.
> ------------------------------------------------------------------
>
>                 Key: NUTCH-1356
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1356
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Ferdy Galema
>             Fix For: nutchgora, 1.6
>
>         Attachments: NUTCH-1356-trunk-v2.patch, NUTCH-1356-trunk.patch, 
> NUTCH-1356.patch
>
>
> Because ParseUtil manages it's own parser threads by creating a thread for 
> every parse it sometimes happens that specific parsers are very expensive. 
> For example, parsers that have threadlocal fields will initialize them for 
> every item to be parsed.
> By simply introducing a caching ExecutorService the ParseUtil will be able to 
> cache threads therefore parsing more efficient. See attached patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply via email to