Support timeouts for parsers
----------------------------
Key: TIKA-456
URL: https://issues.apache.org/jira/browse/TIKA-456
Project: Tika
Issue Type: Improvement
Reporter: Ken Krugler
Assignee: Chris A. Mattmann
There are a number of reasons why Tika could hang while parsing. One common
case is when a parser is fed an incomplete document, such as what happens when
limiting the amount of data fetched during a web crawl.
One solution is to create a TikaCallable that wraps the Tika parser, and then
use this with a FutureTask. For example, when using a ParsedDatum POJO for the
results of the parse operation, I do something like this:
parser = new AutoDetectParser();
Callable<ParsedDatum> c = new TikaCallable(parser, contenthandler,
inputstream, metadata);
FutureTask<ParsedDatum> task = new FutureTask<ParsedDatum>(c);
Thread t = new Thread(task);
t.start();
ParsedDatum result = task.get(MAX_PARSE_DURATION, TimeUnit.SECONDS);
And TikaCallable() looks like:
class TikaCallable implements Callable<ParsedDatum> {
public TikaCallable(Parser parser, ContentHandler handler, InputStream is,
Metadata metadata) {
_parser = parser;
_handler = handler;
_input = is;
_metadata = metadata;
...
}
public ParsedDatum call() throws Exception {
....
_parser.parse(_input, _handler, _metadata, new ParseContext());
....
}
}
This seems like it would be generally useful, as I doubt that we'd ever be
able to guarantee that none of the parsers being wrapped by Tika could ever
hang.
One idea is to create a TimeoutParser that wraps a regular Tika Parser. E.g.
something like:
Parser p = new TimeoutParser(new AutodetectParser(), 20, TimeUnit.SECONDS);
Then the call to p.parse(...) would create a Callable (similar to the code
above) and use the specified timeout when calling task.get().
One minus with this approach is that it creates a new thread for each parse
request, but I don't think the thread overhead is significant when compared to
the typical parser operation.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.