[
https://issues.apache.org/jira/browse/PDFBOX-5006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17222960#comment-17222960
]
Michael Klink commented on PDFBOX-5006:
---------------------------------------
{quote}but I can open them using other pdf viewer (like chrome pdf viewer for
example)
{quote}
Please be aware that PDF viewers or even PDF editors with GUIs usually are very
lax concerning the document validity.
PDF libraries on the other hand must not be as lax because of the missing human
in control.
For example, if you have an invalid PDF which because of some error would
render as rubbish on the PDF recipient's computer, a GUI PDF editor may still
open and show it (doing some repairs under the hood) because the user working
with the editor can (actually *must*, it's part of his job) recognize the
rubbish, stop processing the file and request an undamaged file at the document
source; thus, the final PDF recipient does not get to see this rubbish. A PDF
library in some fully automatized workflow, though, cannot assume that there is
an instance that verifies that in spite of some defects the PDF displays as
desired. Thus, it has to do its best to prevent that the final PDF recipient
will get to see rubbish. And doing its best here can only mean refusing to
process broken PDFs.
IMO PDFBox already now repairs too many errors under the hood.
----
That all being said, though: I downloaded those files and did not encounter any
issues in opening them with PDFBox. Are you sure your download of those files
actually succeeded?
> java.io.IOException: Error: End-of-File, expected line during PDDocument.load
> -----------------------------------------------------------------------------
>
> Key: PDFBOX-5006
> URL: https://issues.apache.org/jira/browse/PDFBOX-5006
> Project: PDFBox
> Issue Type: Bug
> Components: Parsing
> Affects Versions: 2.0.20, 2.0.21
> Environment: Debian, MacOs, open JDK 12
> Reporter: Nicolas M
> Priority: Major
>
> I got an I/O Exception when I try to open some PDF using the lib (calling
> PDDocument.load(pdfFile)). Here are some urls with affected PDF (I think it's
> the same problem for all of them) :
> *
> [https://www.buerger.uni-frankfurt.de/80977779/Rehbein_Schule_Hanau_9_2018.pdf]
> *
> [http://www.geislerfarms.com/documents/filelibrary/Geisler_COVID_statement_0A7A094E1EFB7.pdf]
> *
> [http://www.sahealth.sa.gov.au/wps/wcm/connect/c736e1d5-932e-4f8a-8e56-52ab10a214fd/SALHN+Governing+Board+Minutes+-+5+March+2020.pdf?MOD=AJPERES&CACHEID=ROOTWORKSPACE-c736e1d5-932e-4f8a-8e56-52ab10a214fd-niR9I3J]
> I think the files are not well formatted and doesn't respect PDF specs but I
> can open them using other pdf viewer (like chrome pdf viewer for example)
>
> Here is the stack trace :
> {code:java}
> java.io.IOException: Error: End-of-File, expected linejava.io.IOException:
> Error: End-of-File, expected line at
> org.apache.pdfbox.pdfparser.BaseParser.readLine(BaseParser.java:1098) at
> org.apache.pdfbox.pdfparser.COSParser.parseHeader(COSParser.java:2581) at
> org.apache.pdfbox.pdfparser.COSParser.parsePDFHeader(COSParser.java:2560) at
> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:219) at
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1099) at
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1082) at
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1041) at
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:989)
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]