Manfred Schauer created PDFBOX-5479:
---------------------------------------
Summary: PDFTextStripper needs 1GB heap for a 3.6 MB pdf
Key: PDFBOX-5479
URL: https://issues.apache.org/jira/browse/PDFBOX-5479
Project: PDFBox
Issue Type: Bug
Components: Text extraction
Affects Versions: 2.0.26
Environment: JDK11.0.2 on MacOS 12.4
Reporter: Manfred Schauer
Attachments: heapDump.png, x.pdf
Extracting text from the attached x.pdf:
PDDocument pdDocument = PDDocument.load(new File("/tmp/x.pdf"));
PDFTextStripper stripper = new PDFTextStripper();
stripper.getText(pdDocument);
succeeds with -Xmx1G but throws OOME with -Xmx900m
Heapdump shows 2923 instances of TrueTypeFont, PDRessources.cache contains
SoftReferences to lots of fonts keyed by different COSObjects;
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]