[jira] [Commented] (TIKA-456) Support timeouts for parsers

Ken Krugler (JIRA) Tue, 07 Apr 2015 13:51:34 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14484026#comment-14484026
 ]


Ken Krugler commented on TIKA-456:
----------------------------------

Hi Tim,

Yes, I'm interested in integrating Tika into Common Crawl, especially as I plan 
to try out my new language detector (Yalder) against a bunch of typical HTML 
files.

The main challenge in the past was in reliably processing the WARC files in 
Hadoop; the various approaches available when I was trying this a few years 
back all had differing issues. I haven't been tracking current status, but 
should be better now :)

Is there a page anywhere that discusses various goals for this exercise? I know 
you've been promoting the idea of using Tika to extract text (for CC WET 
results) from files other than HTML, and adding metadata, but what about in 
terms of Tika itself? E.g. tracking hangs/exceptions, and what we'd want to do 
with those results.

Thanks,

-- Ken

> Support timeouts for parsers
> ----------------------------
>
>                 Key: TIKA-456
>                 URL: https://issues.apache.org/jira/browse/TIKA-456
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Ken Krugler
>            Assignee: Chris A. Mattmann
>
> There are a number of reasons why Tika could hang while parsing. One common 
> case is when a parser is fed an incomplete document, such as what happens 
> when limiting the amount of data fetched during a web crawl.
> One solution is to create a TikaCallable that wraps the Tika   parser, and 
> then use this with a FutureTask. For example, when using a ParsedDatum POJO 
> for the results of the parse operation, I do something like this:
>     parser = new AutoDetectParser();
>     Callable<ParsedDatum> c = new TikaCallable(parser, contenthandler, 
> inputstream, metadata);
>     FutureTask<ParsedDatum> task = new  FutureTask<ParsedDatum>(c);
>     Thread t = new Thread(task);
>     t.start();
>     ParsedDatum result = task.get(MAX_PARSE_DURATION, TimeUnit.SECONDS);
> And TikaCallable() looks like:
> class TikaCallable implements Callable<ParsedDatum> {
>     public TikaCallable(Parser parser, ContentHandler handler, InputStream 
> is, Metadata metadata) {
>         _parser = parser;
>         _handler = handler;
>         _input = is;
>         _metadata = metadata;
>         ...
>     }
>     public ParsedDatum call() throws Exception {
>         ....
>         _parser.parse(_input, _handler, _metadata, new ParseContext());
>         ....
>     }
> }
> This seems like it would be generally useful, as I doubt that we'd  ever be 
> able to guarantee that none of the parsers being wrapped by Tika could ever 
> hang.
> One idea is to create a TimeoutParser that wraps a regular Tika Parser. E.g. 
> something like:
>   Parser p = new TimeoutParser(new AutodetectParser(), 20, TimeUnit.SECONDS);
> Then the call to p.parse(...) would create a Callable (similar to the code 
> above) and use the specified timeout when calling task.get().
> One minus with this approach is that it creates a new thread for each parse 
> request, but I don't think the thread overhead is significant when compared 
> to the typical parser operation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-456) Support timeouts for parsers

Reply via email to