[ https://issues.apache.org/jira/browse/TIKA-2848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16814434#comment-16814434 ]
Tim Allison commented on TIKA-2848: ----------------------------------- The next version of PDFBox should be out today or tomorrow. I'll send a link to the build that integrates that. I think we've gotten far enough to determine that you might need help from the PDFBox team. You may want to ask on their user list w reference to this issue. > This file consumes an inordinate amount of memory when parsed by Tika > --------------------------------------------------------------------- > > Key: TIKA-2848 > URL: https://issues.apache.org/jira/browse/TIKA-2848 > Project: Tika > Issue Type: Bug > Reporter: Tim Barrett > Priority: Major > Attachments: Screenshot 2019-04-09 at 16.11.29.png, > Yearbook_1997_r.pdf, Yearbook_2013_s.pdf > > > When this document is parsed by Tika upwards of 4 Gigs of JVM memory is used. > With 5Gigs allocated all of the memory is used and an an inordinate amount of > time is spent garbage collecting. These are quite old PDFs that were created > by a Canon OCR scanner. This can easily be reproduced by using the CLI -- This message was sent by Atlassian JIRA (v7.6.3#76005)