[ 
https://issues.apache.org/jira/browse/PDFBOX-899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12934483#action_12934483
 ] 

Martijn Brinkers commented on PDFBOX-899:
-----------------------------------------

I don't think the OOM is cause by a leak. The OOM happens because the PDF 
contains a large number of fonts and the font cache does not have a sane upper 
limit. I think the font cache should have some sane upper limit and stop 
caching the fonts if the cache already contains the max number of fonts. I have 
added a patch to set an upper limit. I'm not sure what the best default upper 
limit should be so I have used 100. The upper limit can be set using the system 
property -Dpdfontfactory=123.

Because the fonts are only cached, I think the only downside of not caching is 
that parsing will be slower if the cache is already full.  Instead of setting 
an upper limit, it might be nicer to use some kind of cache that can detect 
which fonts are last used and remove the ones that are no longer used.

> OutOfMemoryError with PDFTextStripper
> -------------------------------------
>
>                 Key: PDFBOX-899
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-899
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.3.1
>         Environment: java version "1.6.0_22"
> Java(TM) SE Runtime Environment (build 1.6.0_22-b04)
> Java HotSpot(TM) Client VM (build 17.1-b03, mixed mode)
>            Reporter: Alexander Veit
>            Priority: Critical
>         Attachments: PDFBOX-899.patch
>
>
> PDFBox 1.3.1 has high memory demands when stripping text from PDF files.
> http://www.unicode.org/Public/5.1.0/charts/CodeCharts.pdf even crashes an 
> application server by requiring esimated aditional 300MB+ of heap memory. The 
> heap dump suggests that PDFStreamEngine#documentFontCache might be the root 
> of the leaking objects.
> PDFBox 1.0.0 did not show this behaviour. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to