[
https://issues.apache.org/jira/browse/TIKA-456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chris A. Mattmann updated TIKA-456:
-----------------------------------
Component/s: parser
- classify the component
> Support timeouts for parsers
> ----------------------------
>
> Key: TIKA-456
> URL: https://issues.apache.org/jira/browse/TIKA-456
> Project: Tika
> Issue Type: Improvement
> Components: parser
> Reporter: Ken Krugler
> Assignee: Chris A. Mattmann
>
> There are a number of reasons why Tika could hang while parsing. One common
> case is when a parser is fed an incomplete document, such as what happens
> when limiting the amount of data fetched during a web crawl.
> One solution is to create a TikaCallable that wraps the Tika parser, and
> then use this with a FutureTask. For example, when using a ParsedDatum POJO
> for the results of the parse operation, I do something like this:
> parser = new AutoDetectParser();
> Callable<ParsedDatum> c = new TikaCallable(parser, contenthandler,
> inputstream, metadata);
> FutureTask<ParsedDatum> task = new FutureTask<ParsedDatum>(c);
> Thread t = new Thread(task);
> t.start();
> ParsedDatum result = task.get(MAX_PARSE_DURATION, TimeUnit.SECONDS);
> And TikaCallable() looks like:
> class TikaCallable implements Callable<ParsedDatum> {
> public TikaCallable(Parser parser, ContentHandler handler, InputStream
> is, Metadata metadata) {
> _parser = parser;
> _handler = handler;
> _input = is;
> _metadata = metadata;
> ...
> }
> public ParsedDatum call() throws Exception {
> ....
> _parser.parse(_input, _handler, _metadata, new ParseContext());
> ....
> }
> }
> This seems like it would be generally useful, as I doubt that we'd ever be
> able to guarantee that none of the parsers being wrapped by Tika could ever
> hang.
> One idea is to create a TimeoutParser that wraps a regular Tika Parser. E.g.
> something like:
> Parser p = new TimeoutParser(new AutodetectParser(), 20, TimeUnit.SECONDS);
> Then the call to p.parse(...) would create a Callable (similar to the code
> above) and use the specified timeout when calling task.get().
> One minus with this approach is that it creates a new thread for each parse
> request, but I don't think the thread overhead is significant when compared
> to the typical parser operation.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.