[
https://issues.apache.org/jira/browse/PDFBOX-5479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569732#comment-17569732
]
Michael Klink commented on PDFBOX-5479:
---------------------------------------
Wow, some 3000 form XObjects on page 1, many of them with an own font object,
most of which point to the same font descriptor... that adds up...
> PDFTextStripper needs 1GB heap for a 3.6 MB pdf
> -----------------------------------------------
>
> Key: PDFBOX-5479
> URL: https://issues.apache.org/jira/browse/PDFBOX-5479
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.26
> Environment: JDK11.0.2 on MacOS 12.4
> Reporter: Manfred Schauer
> Priority: Minor
> Attachments: heapDump.png, x.pdf
>
>
> Extracting text from the attached x.pdf:
> PDDocument pdDocument = PDDocument.load(new File("/tmp/x.pdf"));
> PDFTextStripper stripper = new PDFTextStripper();
> stripper.getText(pdDocument);
> succeeds with -Xmx1G but throws OOME with -Xmx900m
> Heapdump shows 2923 instances of TrueTypeFont, PDRessources.cache contains
> SoftReferences to lots of fonts keyed by different COSObjects;
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]