[ 
https://issues.apache.org/jira/browse/PDFBOX-5006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17222960#comment-17222960
 ] 

Michael Klink commented on PDFBOX-5006:
---------------------------------------

{quote}but I can open them using other pdf viewer (like chrome pdf viewer for 
example)
{quote}
Please be aware that PDF viewers or even PDF editors with GUIs usually are very 
lax concerning the document validity.

PDF libraries on the other hand must not be as lax because of the missing human 
in control.

For example, if you have an invalid PDF which because of some error would 
render as rubbish on the PDF recipient's computer, a GUI PDF editor may still 
open and show it (doing some repairs under the hood) because the user working 
with the editor can (actually *must*, it's part of his job) recognize the 
rubbish, stop processing the file and request an undamaged file at the document 
source; thus, the final PDF recipient does not get to see this rubbish. A PDF 
library in some fully automatized workflow, though, cannot assume that there is 
an instance that verifies that in spite of some defects the PDF displays as 
desired. Thus, it has to do its best to prevent that the final PDF recipient 
will get to see rubbish. And doing its best here can only mean refusing to 
process broken PDFs.

IMO PDFBox already now repairs too many errors under the hood.

----

That all being said, though: I downloaded those files and did not encounter any 
issues in opening them with PDFBox. Are you sure your download of those files 
actually succeeded?

> java.io.IOException: Error: End-of-File, expected line during PDDocument.load
> -----------------------------------------------------------------------------
>
>                 Key: PDFBOX-5006
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5006
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 2.0.20, 2.0.21
>         Environment: Debian, MacOs, open JDK 12
>            Reporter: Nicolas M
>            Priority: Major
>
> I got an I/O Exception when I try to open some PDF using the lib (calling 
> PDDocument.load(pdfFile)). Here are some urls with affected PDF (I think it's 
> the same problem for all of them) :
>  * 
> [https://www.buerger.uni-frankfurt.de/80977779/Rehbein_Schule_Hanau_9_2018.pdf]
>  * 
> [http://www.geislerfarms.com/documents/filelibrary/Geisler_COVID_statement_0A7A094E1EFB7.pdf]
>  * 
> [http://www.sahealth.sa.gov.au/wps/wcm/connect/c736e1d5-932e-4f8a-8e56-52ab10a214fd/SALHN+Governing+Board+Minutes+-+5+March+2020.pdf?MOD=AJPERES&CACHEID=ROOTWORKSPACE-c736e1d5-932e-4f8a-8e56-52ab10a214fd-niR9I3J]
> I think the files are not well formatted and doesn't respect PDF specs but I 
> can open them using other pdf viewer (like chrome pdf viewer for example)
>  
> Here is the stack trace : 
> {code:java}
> java.io.IOException: Error: End-of-File, expected linejava.io.IOException: 
> Error: End-of-File, expected line at 
> org.apache.pdfbox.pdfparser.BaseParser.readLine(BaseParser.java:1098) at 
> org.apache.pdfbox.pdfparser.COSParser.parseHeader(COSParser.java:2581) at 
> org.apache.pdfbox.pdfparser.COSParser.parsePDFHeader(COSParser.java:2560) at 
> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:219) at 
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1099) at 
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1082) at 
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1041) at 
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:989)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to