[
https://issues.apache.org/jira/browse/PDFBOX-899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12934483#action_12934483
]
Martijn Brinkers commented on PDFBOX-899:
-----------------------------------------
I don't think the OOM is cause by a leak. The OOM happens because the PDF
contains a large number of fonts and the font cache does not have a sane upper
limit. I think the font cache should have some sane upper limit and stop
caching the fonts if the cache already contains the max number of fonts. I have
added a patch to set an upper limit. I'm not sure what the best default upper
limit should be so I have used 100. The upper limit can be set using the system
property -Dpdfontfactory=123.
Because the fonts are only cached, I think the only downside of not caching is
that parsing will be slower if the cache is already full. Instead of setting
an upper limit, it might be nicer to use some kind of cache that can detect
which fonts are last used and remove the ones that are no longer used.
> OutOfMemoryError with PDFTextStripper
> -------------------------------------
>
> Key: PDFBOX-899
> URL: https://issues.apache.org/jira/browse/PDFBOX-899
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.3.1
> Environment: java version "1.6.0_22"
> Java(TM) SE Runtime Environment (build 1.6.0_22-b04)
> Java HotSpot(TM) Client VM (build 17.1-b03, mixed mode)
> Reporter: Alexander Veit
> Priority: Critical
> Attachments: PDFBOX-899.patch
>
>
> PDFBox 1.3.1 has high memory demands when stripping text from PDF files.
> http://www.unicode.org/Public/5.1.0/charts/CodeCharts.pdf even crashes an
> application server by requiring esimated aditional 300MB+ of heap memory. The
> heap dump suggests that PDFStreamEngine#documentFontCache might be the root
> of the leaking objects.
> PDFBox 1.0.0 did not show this behaviour.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.