[jira] [Commented] (TIKA-3642) Getting java.lang.OutOfMemoryError: Java heap space when parsing PDF file

Tika User (Jira) Wed, 12 Jan 2022 08:05:07 -0800


    [ 
https://issues.apache.org/jira/browse/TIKA-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17474650#comment-17474650
 ]


Tika User commented on TIKA-3642:
---------------------------------

Tried using setMaxMainMemoryBytes still seeing memory issues. The same file 
tika 1.27 not seeing any memory issue or infinite loop issue. We are worried 
about infinite loop issue since it is problem if this issue occur in production 
that to we are seeing only after latest upgrade to 2.2.1. Can please suggest 
safe way to handle this infinite loop. Tried forkparser it is affecting our 
code many places , we usually pointing config.xml and using autodetector 
sending that config for forkparser we don’t have that option. Tried below code 
and alternative solutions are much appreciated at least to handle infinite loop 
from our side. 



 

 

// Init fork parser
List<String> javaArgs = new ArrayList<String>();
forkParser = new ForkParser();
javaArgs.add("java");
javaArgs.add("-Xmx3048m"); // Specify maximum heap space for parsing documents
forkParser.setJavaCommand(javaArgs);
forkParser.setPoolSize(1);



 

try (FileInputStream inputData = new FileInputStream(path)) {
config = TikaConfigFactory.getTikaConfig();
Parser autoDetectParser = new AutoDetectParser(config);
ParseContext context = new ParseContext();
context.set(TikaConfig.class, config);
if (!largefile) {
autoDetectParser.parse(inputData, handler, metadata, context);
} else {
forkParser.parse(inputData, handler, metadata, context);
}
}

 

can we use forkparser same like autoDetectParser sending config to constructor.

> Getting java.lang.OutOfMemoryError: Java heap space when parsing PDF file
> -------------------------------------------------------------------------
>
>                 Key: TIKA-3642
>                 URL: https://issues.apache.org/jira/browse/TIKA-3642
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Tika User
>            Priority: Major
>
> When parsing large PDF files(1.65 GB) we are getting out of memory error. The 
> version we are using 2.0.25(pdfbox)
> java.lang.OutOfMemoryError: Java heap space at 
> org.apache.pdfbox.pdfparser.COSParser.isString



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (TIKA-3642) Getting java.lang.OutOfMemoryError: Java heap space when parsing PDF file

Reply via email to