[jira] Commented: (TIKA-416) Out-of-process text extraction

Chris A. Mattmann (JIRA) Fri, 30 Apr 2010 07:46:15 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12862690#action_12862690
 ]


Chris A. Mattmann commented on TIKA-416:
----------------------------------------

+1, this sounds like a great idea!

We did some work on this in OODT in terms of simple external met extractors and 
so forth. Maybe we could follow a similar approach here. Check out:

http://svn.apache.org/repos/asf/incubator/oodt/cas-metadata/trunk/src/main/java/gov/nasa/jpl/oodt/cas/metadata/extractors/ExternMetExtractor.java

and 

http://svn.apache.org/repos/asf/incubator/oodt/cas-metadata/trunk/src/main/resources/examples/extern-config.xml

as some examples of how to deal with this (NOTE, in OODT-3, we are still in the 
process of converting over the licenses and there are no "official" incubator 
releases of OODT yet, but I just wanted to let you know about it as some 
pointers to ways to get this done). You rock and I can't wait for this feature!

> Out-of-process text extraction
> ------------------------------
>
>                 Key: TIKA-416
>                 URL: https://issues.apache.org/jira/browse/TIKA-416
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Jukka Zitting
>            Priority: Minor
>
> There's currently no easy way to guard against JVM crashes or excessive 
> memory or CPU use caused by parsing very large, broken or intentionally 
> malicious input documents. To better protect against such cases and to 
> generally improve the manageability of resource consumption by Tika it would 
> be great if we had a way to run Tika parsers in separate JVM processes. This 
> could be handled either as a separate "Tika parser daemon" or as an 
> explicitly managed pool of forked JVMs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-416) Out-of-process text extraction

Reply via email to