[jira] [Comment Edited] (TIKA-3642) Getting java.lang.OutOfMemoryError: Java heap space when parsing PDF file

Tilman Hausherr (Jira) Mon, 10 Jan 2022 19:58:05 -0800


    [ 
https://issues.apache.org/jira/browse/TIKA-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17472441#comment-17472441
 ]


Tilman Hausherr edited comment on TIKA-3642 at 1/11/22, 3:57 AM:
-----------------------------------------------------------------

There is no "data truncation". Also, the parameter mentioned isn't perfect, it 
relates to writing streams to disk instead of having them in memory. Complex 
PDFs might still make troubles if they have non-stream complexity, e.g. through 
their structure tree. You need to accept that sometimes, tika will fail on 
individual PDFs. Make sure you have enough memory on your system.


was (Author: tilman):
There is no "data truncation". Also, the parameter mentioned isn't perfect, it 
relates to writing streams to disk. Complex PDFs might still make troubles if 
they have non-stream complexity, e.g. through their structure tree. You need to 
accept that sometimes, tika will fail on individual PDFs.

> Getting java.lang.OutOfMemoryError: Java heap space when parsing PDF file
> -------------------------------------------------------------------------
>
>                 Key: TIKA-3642
>                 URL: https://issues.apache.org/jira/browse/TIKA-3642
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Tika User
>            Priority: Major
>
> When parsing large PDF files(1.65 GB) we are getting out of memory error. The 
> version we are using 2.0.25(pdfbox)
> java.lang.OutOfMemoryError: Java heap space at 
> org.apache.pdfbox.pdfparser.COSParser.isString



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Comment Edited] (TIKA-3642) Getting java.lang.OutOfMemoryError: Java heap space when parsing PDF file

Reply via email to