[
https://issues.apache.org/jira/browse/TIKA-2848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812870#comment-16812870
]
Tim Allison commented on TIKA-2848:
-----------------------------------
Hmmm... I'm able to extract text from both with straight PDFBox ExtractText
2.0.15-rc1 with -Xmx512m, which is actually ok (to me) given that they are each
hundreds of pages.
I'm also able to extract text with -Xmx512m with tika-app: {{java -jar
tika-app-1.21-SNAPSHOT.jar Yearbook...pdf}} on both files. There is quite a
bit of garbage collection on the earlier one, but it still parses in under a
minute.
Tika-app from the commandline streams output, whereas tika-app gui caches the
data...perhaps that's the problem?
How are you calling Tika in your application?
> This file consumes an inordinate amount of memory when parsed by Tika
> ---------------------------------------------------------------------
>
> Key: TIKA-2848
> URL: https://issues.apache.org/jira/browse/TIKA-2848
> Project: Tika
> Issue Type: Bug
> Reporter: Tim Barrett
> Priority: Major
> Attachments: Yearbook_1997_r.pdf, Yearbook_2013_s.pdf
>
>
> When this document is parsed by Tika upwards of 4 Gigs of JVM memory is used.
> With 5Gigs allocated all of the memory is used and an an inordinate amount of
> time is spent garbage collecting. These are quite old PDFs that were created
> by a Canon OCR scanner. This can easily be reproduced by using the CLI
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)