[ 
https://issues.apache.org/jira/browse/TIKA-1471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14208100#comment-14208100
 ] 

Tim Allison commented on TIKA-1471:
-----------------------------------

Ah, thank you for sharing this use case. The first step for tika-batch is disk 
to disk, but if there are other common use cases, we should add those (more 
robust tika-server, for example).  I've found a separate jvm for Tika alone 
(despite the added storage) is the most robust way to handle large batches of 
potentially dangerous files; keep tika in a separate jvm from the indexer or 
next step in processing.

Right, I had forgotten to mention memory leaks as one of the things integrators 
have to deal with.  Thank you.

I wonder if PDFBOX-2200/TIKA-1424 is the culprit for the memory leak you 
mention.

> OOM with corrupt PDF file
> -------------------------
>
>                 Key: TIKA-1471
>                 URL: https://issues.apache.org/jira/browse/TIKA-1471
>             Project: Tika
>          Issue Type: Bug
>          Components: general
>    Affects Versions: 1.6
>         Environment: Linux, JVM 1.8.0_25-b17, 64-bit
>            Reporter: Alan Burlison
>            Priority: Blocker
>             Fix For: 1.7
>
>
> Use of PDFBox 1.8.6 by Tika 1.6 is causing OOM errors with corrupt PDF files, 
> due to a bug in PDFBox, see PDFBOX-2493. This makes Tika 1.6 unusable from 
> inside a long-running webapp and I've had to revert to Tika 1.5. Although 1.5 
> also throws errors with the corrupt file it does not cause OOM errors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to