[
https://issues.apache.org/jira/browse/PDFBOX-5479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569829#comment-17569829
]
Andreas Lehmkühler commented on PDFBOX-5479:
--------------------------------------------
Ahh, one of these extreme corner cases. We have to blame Adobe InDesign for
such an inefficient usage of XObjects. On the and we might think about our own
font handling. Maybe it makes sense to bound the created font to the font
descriptor instead of the font object itself.
> PDFTextStripper needs 1GB heap for a 3.6 MB pdf
> -----------------------------------------------
>
> Key: PDFBOX-5479
> URL: https://issues.apache.org/jira/browse/PDFBOX-5479
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.26
> Environment: JDK11.0.2 on MacOS 12.4
> Reporter: Manfred Schauer
> Priority: Minor
> Attachments: heapDump.png, x.pdf
>
>
> Extracting text from the attached x.pdf:
> PDDocument pdDocument = PDDocument.load(new File("/tmp/x.pdf"));
> PDFTextStripper stripper = new PDFTextStripper();
> stripper.getText(pdDocument);
> succeeds with -Xmx1G but throws OOME with -Xmx900m
> Heapdump shows 2923 instances of TrueTypeFont, PDRessources.cache contains
> SoftReferences to lots of fonts keyed by different COSObjects;
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]