[
https://issues.apache.org/jira/browse/TIKA-1737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14902528#comment-14902528
]
Tim Allison edited comment on TIKA-1737 at 9/22/15 4:16 PM:
------------------------------------------------------------
bq. there were many more that just had a single line of error
Try adding this to your jvm invocation
{{-XX:-OmitStackTraceInFastThrow}}...this might be a Java optimization.
bq. the real issue are the horrendous memory leaks caused whenever a PDFBox
exception is thrown, that's definitely got worse
Have you done the profiling to determine the memory leaks are caused by
exceptions being thrown? That's interesting...
was (Author: [email protected]):
bq. there were many more that just had a single line of error
Try adding this to your jvm invocation
{{-JXX:-OmitStackTraceInFastThrow}}...this might be a Java optimization.
bq. the real issue are the horrendous memory leaks caused whenever a PDFBox
exception is thrown, that's definitely got worse
Have you done the profiling to determine the memory leaks are caused by
exceptions being thrown? That's interesting...
> PDFBox 1.8.10 is still a basket case
> ------------------------------------
>
> Key: TIKA-1737
> URL: https://issues.apache.org/jira/browse/TIKA-1737
> Project: Tika
> Issue Type: Bug
> Components: general
> Affects Versions: 1.10
> Environment: Linux, Solaris
> Reporter: Alan Burlison
> Attachments: pdfbox.txt
>
>
> In TIKA-1471 I reported OOM errors when parsing PDF files. According to that
> bug the issues were fixed in 1.7. I've just updated to Tika 1.10 and rather
> than PDFBox being better it's actually far, far worse. With the same corpus,
> Tika 1.5 (PDFBox 1.8.6) has 13 exceptions thrown by PDFBox, Tika 1.10 (PDFBox
> 1.8.10) has *453* exceptions thrown by PDFBox. Not only that, but as far as I
> can tell, the memory leaks are even worse in 1.8.10 as well.
> I've had to resort to destroying the Tika instances and starting over each
> time there's an error indexing a PDF file. It's so bad I'm going to switch to
> running pdftotext (part of Xpdf) as an external process. Note that many of
> the errors in PDFBox are clearly caused by programming errors, e.g.
> ArrayIndexOutOfBoundsException, ClassCastException, NullPointerException and
> EOFException.
> I strongly recommend that Tika either reverts back to PDFBox 1.8.6 or finds a
> replacement for PDFBox as 1.8.10 just isn't fit for purpose.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)