[jira] [Commented] (TIKA-2848) This file consumes an inordinate amount of memory when parsed by Tika

Tim Barrett (JIRA) Tue, 09 Apr 2019 07:21:07 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-2848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16813479#comment-16813479
 ]


Tim Barrett commented on TIKA-2848:
-----------------------------------

I'm using tika-app-1.20.jar. I was using  -Xmx4g - that hangs consistently when 
old gen is full and CPU is 100% occupied with GC. Strangely when using -Xmx512m 
as you did, the GC when old gen is full it takes a couple of seconds to compete 
the GC after which it completes and memory behaviour is normal. With 4g the GC 
step takes 3.5 minutes.

Our code uses the autodetect parser and our own bodyContentHandler. We need to 
allocate more than 512m (typically 5G) as we parse large numbers of files 
(millions potentially)

Thanks for the help so dar, could you send me a link to get hold of the version 
21 snapshot jar I can try that as well.

> This file consumes an inordinate amount of memory when parsed by Tika
> ---------------------------------------------------------------------
>
>                 Key: TIKA-2848
>                 URL: https://issues.apache.org/jira/browse/TIKA-2848
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Tim Barrett
>            Priority: Major
>         Attachments: Yearbook_1997_r.pdf, Yearbook_2013_s.pdf
>
>
> When this document is parsed by Tika upwards of 4 Gigs of JVM memory is used. 
> With 5Gigs allocated all of the memory is used and an an inordinate amount of 
> time is spent garbage collecting. These are quite old PDFs that were created 
> by a Canon OCR scanner. This can easily be reproduced by using the CLI 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TIKA-2848) This file consumes an inordinate amount of memory when parsed by Tika

Reply via email to