[
https://issues.apache.org/jira/browse/TIKA-416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jukka Zitting resolved TIKA-416.
--------------------------------
Resolution: Fixed
Fix Version/s: 0.9
Assignee: Jukka Zitting
An initial version of this feature is now working and included in the latest
trunk.
To illustrate the improvement, here's what I'm seeing for example with one
somewhat large Excel document:
$ java -Xmx32m -jar tika-app-0.9-SNAPSHOT.jar large.xls
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at
org.apache.poi.poifs.storage.RawDataBlock.<init>(RawDataBlock.java:69)
at
org.apache.poi.poifs.storage.RawDataBlockList.<init>(RawDataBlockList.java:55)
at
org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.java:157)
at
org.apache.tika.detect.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:145)
at
org.apache.tika.detect.POIFSContainerDetector.detect(POIFSContainerDetector.java:96)
at
org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:60)
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:126)
at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:94)
at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:273)
at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:80)
The OutOfMemoryError is really troublesome in many container environments where
hitting the memory limit affects all active threads, not just the one using
Tika.
With the new out-of-process parsing feature, it's possible to externalize this
problem into a separate background process:
$ java -Xmx32m -jar tika-app-0.9-SNAPSHOT.jar --fork comlex-document.xls
Exception in thread "main" java.io.IOException: Lost connection to a forked
server process
at org.apache.tika.fork.ForkClient.waitForResponse(ForkClient.java:149)
at org.apache.tika.fork.ForkClient.call(ForkClient.java:84)
at org.apache.tika.fork.ForkParser.parse(ForkParser.java:78)
at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:94)
at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:273)
at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:80)
Such normal exceptions are much easier to recover from.
> Out-of-process text extraction
> ------------------------------
>
> Key: TIKA-416
> URL: https://issues.apache.org/jira/browse/TIKA-416
> Project: Tika
> Issue Type: New Feature
> Components: parser
> Reporter: Jukka Zitting
> Assignee: Jukka Zitting
> Priority: Minor
> Fix For: 0.9
>
>
> There's currently no easy way to guard against JVM crashes or excessive
> memory or CPU use caused by parsing very large, broken or intentionally
> malicious input documents. To better protect against such cases and to
> generally improve the manageability of resource consumption by Tika it would
> be great if we had a way to run Tika parsers in separate JVM processes. This
> could be handled either as a separate "Tika parser daemon" or as an
> explicitly managed pool of forked JVMs.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.