[jira] [Commented] (TIKA-1471) OOM with corrupt PDF file

Alan Burlison (JIRA) Wed, 12 Nov 2014 10:19:14 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-1471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14208383#comment-14208383
 ]


Alan Burlison commented on TIKA-1471:
-------------------------------------

Running a separate indexer JVM would be safer but up until now I haven't had 
anything that causes fatal errors. I already have to spawn ps2ascii 
(ghostscript) sub-processes for Postscript files as PDFBox doesn't cope with 
some of the older ones in the corpus and the impact on indexing time is 
significant, so I want to do as much as possible from within the same JVM.

bq. I wonder if PDFBOX-2200/TIKA-1424 is the culprit for the memory leak you 
mention.

Adding the workaround from TIKA-1424 (calling 
org.apache.pdfbox.pdmodel.font.PDFont.clearResources) does seem to help a bit 
but I'm a bit wary about calling a static method that affects global state when 
multiple threads are running. I'm therefore just going to call it a the end of 
each index run - they are normally incremental so it's only the initial index 
build that reads the whole corpus. Although mem usage is approx ~4Gb after a 
full reindex I can just restart the appserver if necessary.

Thanks for the helpful hints and tips :-)

> OOM with corrupt PDF file
> -------------------------
>
>                 Key: TIKA-1471
>                 URL: https://issues.apache.org/jira/browse/TIKA-1471
>             Project: Tika
>          Issue Type: Bug
>          Components: general
>    Affects Versions: 1.6
>         Environment: Linux, JVM 1.8.0_25-b17, 64-bit
>            Reporter: Alan Burlison
>            Priority: Blocker
>             Fix For: 1.7
>
>
> Use of PDFBox 1.8.6 by Tika 1.6 is causing OOM errors with corrupt PDF files, 
> due to a bug in PDFBox, see PDFBOX-2493. This makes Tika 1.6 unusable from 
> inside a long-running webapp and I've had to revert to Tika 1.5. Although 1.5 
> also throws errors with the corrupt file it does not cause OOM errors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1471) OOM with corrupt PDF file

Reply via email to