Matthew Buckett created PDFBOX-2200:
---------------------------------------
Summary: Memory leak with
org.apache.pdfbox.pdmodel.font.PDFont#cmapObjects
Key: PDFBOX-2200
URL: https://issues.apache.org/jira/browse/PDFBOX-2200
Project: PDFBox
Issue Type: Bug
Components: PDModel
Affects Versions: 1.8.6
Reporter: Matthew Buckett
We use Tika to extract text from a large number (10,000+) of PDFs in a long
running JVM, after doing this for a while we started running short of heap
space. A heap dump shows that about 717MB of heap is retained through
org.apache.pdfbox.pdmodel.font.PDFont#cmapObjects and the hashmap has 18001
entries.
PDFBOX-1009 looked to partially address this but it appears the symptons are
still present. As a workaround I'm going to manually call
PDFont.clearResources() after indexing each document to prevent this happening,
but it would be better if I didn't have to.
--
This message was sent by Atlassian JIRA
(v6.2#6252)