[ 
https://issues.apache.org/jira/browse/PDFBOX-5006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17223024#comment-17223024
 ] 

Michael Klink commented on PDFBOX-5006:
---------------------------------------

{quote}if I open the PDF in Chrome and save it, PDFBox can open it without 
problems but not if I wget it. I don't know how you tried but I think Chrome 
viewer fix it during saving...
{quote}
No, Chrome didn't fix anything. It's the other way around, {{wget}} simply 
didn't retrieve the file:
h4. Geisler_COVID_statement_0A7A094E1EFB7.pdf
{noformat}
--2020-10-29 17:23:35--  
http://www.geislerfarms.com/documents/filelibrary/Geisler_COVID_statement_0A7A094E1EFB7.pdf
Auflösen des Hostnamen »www.geislerfarms.com (www.geislerfarms.com)«... 
205.237.127.32
Verbindungsaufbau zu www.geislerfarms.com 
(www.geislerfarms.com)|205.237.127.32|:80... verbunden.
HTTP-Anforderung gesendet, warte auf Antwort... 403 Forbidden
2020-10-29 17:23:35 FEHLER 403: Forbidden.
{noformat}
Apparently {{wget}} is not allowed to retrieve the file, the request 
immediately is rejected with error code 403.
h4. SALHN+Governing+Board+Minutes+-+5+March+2020.pdf
{noformat}
--2020-10-29 17:28:04--  
http://www.sahealth.sa.gov.au/wps/wcm/connect/c736e1d5-932e-4f8a-8e56-52ab10a214fd/SALHN+Governing+Board+Minutes+-+5+March+2020.pdf?MOD=AJPERES&CACHEID=ROOTWORKSPACE-c736e1d5-932e-4f8a-8e56-52ab10a214fd-niR9I3J
Auflösen des Hostnamen »www.sahealth.sa.gov.au (www.sahealth.sa.gov.au)«... 
184.24.25.168
Verbindungsaufbau zu www.sahealth.sa.gov.au 
(www.sahealth.sa.gov.au)|184.24.25.168|:80... verbunden.
HTTP-Anforderung gesendet, warte auf Antwort... 301 Moved Permanently
Platz: 
https://www.sahealth.sa.gov.au/wps/wcm/connect/c736e1d5-932e-4f8a-8e56-52ab10a214fd/SALHN+Governing+Board+Minutes+-+5+March+2020.pdf?MOD=AJPERES&CACHEID=ROOTWORKSPACE-c736e1d5-932e-4f8a-8e56-52ab10a214fd-niR9I3J[folge]
--2020-10-29 17:28:04--  
https://www.sahealth.sa.gov.au/wps/wcm/connect/c736e1d5-932e-4f8a-8e56-52ab10a214fd/SALHN+Governing+Board+Minutes+-+5+March+2020.pdf?MOD=AJPERES&CACHEID=ROOTWORKSPACE-c736e1d5-932e-4f8a-8e56-52ab10a214fd-niR9I3J
Verbindungsaufbau zu www.sahealth.sa.gov.au 
(www.sahealth.sa.gov.au)|184.24.25.168|:443... verbunden.
HTTP-Anforderung gesendet, warte auf Antwort... 200 OK
Länge: 188 [text/html]
In 
»»SALHN+Governing+Board+Minutes+-+5+March+2020.pdf?MOD=AJPERES&CACHEID=ROOTWORKSPACE-c736e1d5-932e-4f8a-8e56-52ab10a214fd-niR9I3J««
 speichern.
2020-10-29 17:28:05 (2,79 MB/s) - 
»»SALHN+Governing+Board+Minutes+-+5+March+2020.pdf?MOD=AJPERES&CACHEID=ROOTWORKSPACE-c736e1d5-932e-4f8a-8e56-52ab10a214fd-niR9I3J««
 gespeichert [188/188]
{noformat}
At first glance this looks better, but consider the {{Länge: 188 [text/html]}}! 
Neither is that the expected length nor should a PDF be transmitted as 
{{text/html}}.

Indeed, looking into the file one sees:
{noformat}
<html><head><title>Request Rejected</title></head><body>The requested URL was 
rejected. Please consult with your administrator.<br><br>Your support ID is: 
9465683539026012375</body></html>
{noformat}
I.e. just another kind of {{wget}} rejection.

----

Thus, please update your workflow to check whether your {{wget}} calls were 
executed successfully and really retrieved a PDF.

> java.io.IOException: Error: End-of-File, expected line during PDDocument.load
> -----------------------------------------------------------------------------
>
>                 Key: PDFBOX-5006
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5006
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 2.0.20, 2.0.21
>         Environment: Debian, MacOs, open JDK 12
>            Reporter: Nicolas M
>            Priority: Major
>
> I got an I/O Exception when I try to open some PDF using the lib (calling 
> PDDocument.load(pdfFile)). Here are some urls with affected PDF (I think it's 
> the same problem for all of them) :
>  * 
> [https://www.buerger.uni-frankfurt.de/80977779/Rehbein_Schule_Hanau_9_2018.pdf]
>  * 
> [http://www.geislerfarms.com/documents/filelibrary/Geisler_COVID_statement_0A7A094E1EFB7.pdf]
>  * 
> [http://www.sahealth.sa.gov.au/wps/wcm/connect/c736e1d5-932e-4f8a-8e56-52ab10a214fd/SALHN+Governing+Board+Minutes+-+5+March+2020.pdf?MOD=AJPERES&CACHEID=ROOTWORKSPACE-c736e1d5-932e-4f8a-8e56-52ab10a214fd-niR9I3J]
> I think the files are not well formatted and doesn't respect PDF specs but I 
> can open them using other pdf viewer (like chrome pdf viewer for example)
>  
> Here is the stack trace : 
> {code:java}
> java.io.IOException: Error: End-of-File, expected linejava.io.IOException: 
> Error: End-of-File, expected line at 
> org.apache.pdfbox.pdfparser.BaseParser.readLine(BaseParser.java:1098) at 
> org.apache.pdfbox.pdfparser.COSParser.parseHeader(COSParser.java:2581) at 
> org.apache.pdfbox.pdfparser.COSParser.parsePDFHeader(COSParser.java:2560) at 
> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:219) at 
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1099) at 
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1082) at 
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1041) at 
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:989)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to