Carey Halton created TIKA-4047:
----------------------------------
Summary: Various PDF Parsing errors
Key: TIKA-4047
URL: https://issues.apache.org/jira/browse/TIKA-4047
Project: Tika
Issue Type: Bug
Components: parser
Affects Versions: 2.4.1
Environment: Windows 11, using Tika server /tika/body API.
Reporter: Carey Halton
Attachments: ML100500495 error.txt, ML100500495.PDF, ML100840685
error.txt, ML100840685.pdf, ML22020A080 error.txt, ML22020A080.pdf
We are seeing various PDF parser errors for a few specific PDF files with Tika
2.4.1. We were hoping that someone could help us investigate and see if there
are bugs with the PDF parser or PDFBox that could be fixed to allow these to be
parsed (or let us know if they are already fixed in a later version), or if
there is just something corrupted about these particular files that makes
parsing them impossible. I have attached the 3 files as well as txt files that
include the exception message we are seeing for each of them.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)