[jira] [Commented] (TIKA-456) Support timeouts for parsers

Ken Krugler (JIRA) Wed, 18 Mar 2015 07:19:48 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14367172#comment-14367172
 ]


Ken Krugler commented on TIKA-456:
----------------------------------

Re killing a thread - yes, that's not possible to do safely. When parsing a web 
crawl, we try to interrupt a hung parse thread, and then we abandon it. So 
we'll wind up with some zombie processes over time, but that's still a win 
compared to either (a) having the job hang & fail, or (b) having a much slower 
parse because we're forking the JVM for each of the many hundreds of millions 
of documents. And in Hadoop we can limit the number of times a child JVM is 
reused, so eventually the hung threads get cleaned up.

So supporting the Thread.sleep(x) setting in ForkServer wouldn't meet the goal 
of why I originally filed this issue. I would suggest opening up another issue 
to track that change (which seems like a good idea, BTW).

It seems like my use case isn't common enough to include a TikaCallable in the 
main code. If so, then feel free to close out this issue.

> Support timeouts for parsers
> ----------------------------
>
>                 Key: TIKA-456
>                 URL: https://issues.apache.org/jira/browse/TIKA-456
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Ken Krugler
>            Assignee: Chris A. Mattmann
>
> There are a number of reasons why Tika could hang while parsing. One common 
> case is when a parser is fed an incomplete document, such as what happens 
> when limiting the amount of data fetched during a web crawl.
> One solution is to create a TikaCallable that wraps the Tika   parser, and 
> then use this with a FutureTask. For example, when using a ParsedDatum POJO 
> for the results of the parse operation, I do something like this:
>     parser = new AutoDetectParser();
>     Callable<ParsedDatum> c = new TikaCallable(parser, contenthandler, 
> inputstream, metadata);
>     FutureTask<ParsedDatum> task = new  FutureTask<ParsedDatum>(c);
>     Thread t = new Thread(task);
>     t.start();
>     ParsedDatum result = task.get(MAX_PARSE_DURATION, TimeUnit.SECONDS);
> And TikaCallable() looks like:
> class TikaCallable implements Callable<ParsedDatum> {
>     public TikaCallable(Parser parser, ContentHandler handler, InputStream 
> is, Metadata metadata) {
>         _parser = parser;
>         _handler = handler;
>         _input = is;
>         _metadata = metadata;
>         ...
>     }
>     public ParsedDatum call() throws Exception {
>         ....
>         _parser.parse(_input, _handler, _metadata, new ParseContext());
>         ....
>     }
> }
> This seems like it would be generally useful, as I doubt that we'd  ever be 
> able to guarantee that none of the parsers being wrapped by Tika could ever 
> hang.
> One idea is to create a TimeoutParser that wraps a regular Tika Parser. E.g. 
> something like:
>   Parser p = new TimeoutParser(new AutodetectParser(), 20, TimeUnit.SECONDS);
> Then the call to p.parse(...) would create a Callable (similar to the code 
> above) and use the specified timeout when calling task.get().
> One minus with this approach is that it creates a new thread for each parse 
> request, but I don't think the thread overhead is significant when compared 
> to the typical parser operation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-456) Support timeouts for parsers

Reply via email to