[
https://issues.apache.org/jira/browse/PDFBOX-3856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16082318#comment-16082318
]
Nicholas DiPiazza commented on PDFBOX-3856:
-------------------------------------------
The client's PDF that caused it was unavailable to me. and i couldn't reproduce
it with Google Sheets by myself. something must have been special with my
client's PDF but it was confidential and not available for me to share.
But the real problem for us was that we were converting a spreadsheet to PDF
using google drive in the first place. we really should have used export to
XLSX and that involves tika with a different parser. which works fine.
> Non-large PDF's can cause Out of Memory Exceptions
> --------------------------------------------------
>
> Key: PDFBOX-3856
> URL: https://issues.apache.org/jira/browse/PDFBOX-3856
> Project: PDFBox
> Issue Type: Bug
> Affects Versions: 2.0.1
> Reporter: Nicholas DiPiazza
> Priority: Blocker
> Attachments: Pasted image at 2017_07_05 02_26 PM.png
>
>
> Tika version: 1.13
> PDFBox Version: 2.0.1
> We are using an application that attempts to make PDFs searchable using
> Apache Tika which in downstream uses PDF Box to parse PDFs to extract the
> body of a PDF in text to make it searchable.
> We allow basically any PDF from anywhere to come in as long as it isn't too
> large (9 MB).
> However, we are noticing some PDFs, even though they are not that large in
> file size, can cause zip bombs to eat up all the heap space and crash the JVM.
> There is some sort of Object[] array that has millions of
> {code}org.apache.pdfbox.text.TextPosition{code}
> Here is a snapshot of the heapdump:
> https://issues.apache.org/jira/secure/attachment/12875808/Pasted%20image%20at%202017_07_05%2002_26%20PM.png
> Is there a setting to limit the size of this particular array so that it
> doesn't cause a memory bomb?
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]