[
https://issues.apache.org/jira/browse/TIKA-4047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17724279#comment-17724279
]
Carey Halton commented on TIKA-4047:
------------------------------------
Thanks for testing it. Unfortunately upgrading Tika is not a trivial process
for us due to some custom code that we have on top of it. Do you know if
upgrading PDFBox would be sufficient? Were there any breaking changes with
PDFBox since 2.4.1 that we would also need to cherry pick into our fork to
accommodate upgrading it?
> Various PDF Parsing errors
> --------------------------
>
> Key: TIKA-4047
> URL: https://issues.apache.org/jira/browse/TIKA-4047
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 2.4.1
> Environment: Windows 11, using Tika server /tika/body API.
> Reporter: Carey Halton
> Priority: Minor
> Attachments: ML100500495 error.txt, ML100500495.PDF, ML100840685
> error.txt, ML100840685.pdf, ML22020A080 error.txt, ML22020A080.pdf
>
>
> We are seeing various PDF parser errors for a few specific PDF files with
> Tika 2.4.1. We were hoping that someone could help us investigate and see if
> there are bugs with the PDF parser or PDFBox that could be fixed to allow
> these to be parsed (or let us know if they are already fixed in a later
> version), or if there is just something corrupted about these particular
> files that makes parsing them impossible. I have attached the 3 files as well
> as txt files that include the exception message we are seeing for each of
> them.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)