[
https://issues.apache.org/jira/browse/PDFBOX-5479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andreas Lehmkühler updated PDFBOX-5479:
---------------------------------------
Issue Type: Improvement (was: Bug)
> PDFTextStripper needs 1GB heap for a 3.6 MB pdf
> -----------------------------------------------
>
> Key: PDFBOX-5479
> URL: https://issues.apache.org/jira/browse/PDFBOX-5479
> Project: PDFBox
> Issue Type: Improvement
> Components: Text extraction
> Affects Versions: 2.0.26
> Environment: JDK11.0.2 on MacOS 12.4
> Reporter: Manfred Schauer
> Priority: Minor
> Attachments: heapDump.png, x.pdf
>
>
> Extracting text from the attached x.pdf:
> PDDocument pdDocument = PDDocument.load(new File("/tmp/x.pdf"));
> PDFTextStripper stripper = new PDFTextStripper();
> stripper.getText(pdDocument);
> succeeds with -Xmx1G but throws OOME with -Xmx900m
> Heapdump shows 2923 instances of TrueTypeFont, PDRessources.cache contains
> SoftReferences to lots of fonts keyed by different COSObjects;
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]