[ 
https://issues.apache.org/jira/browse/PDFBOX-5479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569938#comment-17569938
 ] 

Manfred Schauer commented on PDFBOX-5479:
-----------------------------------------

thanks for looking into the PDF; I'm sure it's a pathological one;

would be nice to be able to limit used heap somehow, even at the price of 
reduced correctness. I do not understand the internals of PDFBox, but your 
usage of caches and SoftReferences indicates that you put some effort into 
trading memory against CPU; 

in my use-case, a parameter that limits the number of stored fonts per 
PDDocument would probably be sufficient, throwing an Exception if the limit is 
exceeded, defaulting to Integer.MAX_VALUE for those that can commit unlimited 
memory. The effect would be to hit an exception in corner cases, which is 
better than OOMEs.

But to be honest, do not really understand the PDF root-cause ...

> PDFTextStripper needs 1GB heap for a 3.6 MB pdf
> -----------------------------------------------
>
>                 Key: PDFBOX-5479
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5479
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.26
>         Environment: JDK11.0.2 on MacOS 12.4
>            Reporter: Manfred Schauer
>            Priority: Minor
>         Attachments: heapDump.png, x.pdf
>
>
> Extracting text from the attached x.pdf:
> PDDocument pdDocument = PDDocument.load(new File("/tmp/x.pdf"));
> PDFTextStripper stripper = new PDFTextStripper();
> stripper.getText(pdDocument);
> succeeds with -Xmx1G but throws OOME with -Xmx900m
> Heapdump shows 2923 instances of TrueTypeFont, PDRessources.cache contains 
> SoftReferences to lots of fonts keyed by different COSObjects;



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to