[
https://issues.apache.org/jira/browse/PDFBOX-2200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14208483#comment-14208483
]
Tim Allison commented on PDFBOX-2200:
-------------------------------------
[~alanbur] recently pointed out on TIKA-1471 that running clearResources() in a
multithreaded environment is a bad idea. Would it make sense (shudder) to make
cmapObjects ThreadLocal? Or is there another recommendation for what we
should do until 2.0 is released if we're running PDFBox in multiple threads in
a long running process?
> Memory leak with org.apache.pdfbox.pdmodel.font.PDFont#cmapObjects
> ------------------------------------------------------------------
>
> Key: PDFBOX-2200
> URL: https://issues.apache.org/jira/browse/PDFBOX-2200
> Project: PDFBox
> Issue Type: Bug
> Components: PDModel
> Affects Versions: 1.8.6, 2.0.0
> Reporter: Matthew Buckett
> Fix For: 2.0.0
>
>
> We use Tika to extract text from a large number (10,000+) of PDFs in a long
> running JVM, after doing this for a while we started running short of heap
> space. A heap dump shows that about 717MB of heap is retained through
> org.apache.pdfbox.pdmodel.font.PDFont#cmapObjects and the hashmap has 18001
> entries.
> PDFBOX-1009 looked to partially address this but it appears the symptons are
> still present. As a workaround I'm going to manually call
> PDFont.clearResources() after indexing each document to prevent this
> happening, but it would be better if I didn't have to.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)