Support timeouts for parsers
----------------------------

                 Key: TIKA-456
                 URL: https://issues.apache.org/jira/browse/TIKA-456
             Project: Tika
          Issue Type: Improvement
            Reporter: Ken Krugler
            Assignee: Chris A. Mattmann


There are a number of reasons why Tika could hang while parsing. One common 
case is when a parser is fed an incomplete document, such as what happens when 
limiting the amount of data fetched during a web crawl.

One solution is to create a TikaCallable that wraps the Tika   parser, and then 
use this with a FutureTask. For example, when using a ParsedDatum POJO for the 
results of the parse operation, I do something like this:

    parser = new AutoDetectParser();
    Callable<ParsedDatum> c = new TikaCallable(parser, contenthandler, 
inputstream, metadata);
    FutureTask<ParsedDatum> task = new  FutureTask<ParsedDatum>(c);
    Thread t = new Thread(task);
    t.start();

    ParsedDatum result = task.get(MAX_PARSE_DURATION, TimeUnit.SECONDS);

And TikaCallable() looks like:

class TikaCallable implements Callable<ParsedDatum> {
    public TikaCallable(Parser parser, ContentHandler handler, InputStream is, 
Metadata metadata) {
        _parser = parser;
        _handler = handler;
        _input = is;
        _metadata = metadata;
        ...
    }

    public ParsedDatum call() throws Exception {
        ....
        _parser.parse(_input, _handler, _metadata, new ParseContext());
        ....
    }
}

This seems like it would be generally useful, as I doubt that we'd  ever be 
able to guarantee that none of the parsers being wrapped by Tika could ever 
hang.

One idea is to create a TimeoutParser that wraps a regular Tika Parser. E.g. 
something like:

  Parser p = new TimeoutParser(new AutodetectParser(), 20, TimeUnit.SECONDS);

Then the call to p.parse(...) would create a Callable (similar to the code 
above) and use the specified timeout when calling task.get().

One minus with this approach is that it creates a new thread for each parse 
request, but I don't think the thread overhead is significant when compared to 
the typical parser operation.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to