[
https://issues.apache.org/jira/browse/TIKA-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17474997#comment-17474997
]
Tim Allison commented on TIKA-3642:
-----------------------------------
Got your file. Thank you. That was critical. What's going on is that in
tika-1.x we're defaulting to 512MB for maxMainMemory. In tika-2.x, the default
is -1. This is {*}bad{*}, and we should fix this quickly.
I was able to parse the file without a problem in 1.x with -Xmx1g, and when I
used this config in 2.x, I got the same behavior. If I didn't use this config,
I got an OOM with -Xmx2g (I didn't try higher).
{noformat}
<properties>
<parsers>
<parser class="org.apache.tika.parser.DefaultParser">
<parser-exclude
class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
</parser>
<parser class="org.apache.tika.parser.pdf.PDFParser">
<params>
<param name="maxMainMemoryBytes" type="long">524288000</param>
</params>
</parser>
</parsers>
</properties>
{noformat}
> Getting java.lang.OutOfMemoryError: Java heap space when parsing PDF file
> -------------------------------------------------------------------------
>
> Key: TIKA-3642
> URL: https://issues.apache.org/jira/browse/TIKA-3642
> Project: Tika
> Issue Type: Bug
> Reporter: Tika User
> Priority: Major
>
> When parsing large PDF files(1.65 GB) we are getting out of memory error. The
> version we are using 2.0.25(pdfbox)
> java.lang.OutOfMemoryError: Java heap space at
> org.apache.pdfbox.pdfparser.COSParser.isString
--
This message was sent by Atlassian Jira
(v8.20.1#820001)