[ 
https://issues.apache.org/jira/browse/PDFBOX-4495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16802774#comment-16802774
 ] 

Michael Klink commented on PDFBOX-4495:
---------------------------------------

{quote}[~Giorgy]>Are you talking about an issue in PDFBox or corrupt document? 
Because the document is open with Chrome or DocumentViewer (Ubuntu) and I can 
see the content without problems.
{quote}
First of all, that a document opens with a number of PDF viewers does not 
automatically imply that it is not corrupted. PDF viewers (first and foremost 
Adobe Reader but then mimicked by other viewers to be compatible with Adobe) 
have a long history of repairing numerous errors under the hood.

That being said, an indirect object reference like {{18446744071795507394 0 R}} 
apparently _is_ allowed by the specification, at least by its letter. Even 
though it is extremely implausible that a PDF will ever contain enough objects 
to make an object number as large as {{18446744071795507394}} necessary. Limits 
for integers are only implementation limits, not hard limits.

Thus, in this case it most likely (as you haven't shared the PDF, nothing is 
sure) is an issue of PDFBox; but it is an issue that _the software that 
generated your PDF explicitly wanted to trigger_.
{quote}[~tilman]>return COSNull instead of throwing an exception, as suggested 
by Michael Klink
{quote}
Well, by my "strictly speaking" I wanted to imply that I'm not sure whether I 
really should suggest implementing it so. Using such object numbers implies 
that the generating software explicitly wants to tease PDF processors, and I'm 
not sure one has to play along... (Well, or the generating software is broken 
and, therefore, generates wrong object numbers. In this case one definitively 
doesn't have to play along.)

But if you really want to follow that suggestion, you should check *all* code 
that recognizes references and update it accordingly, not only in 
{{parseCOSDictionaryValue}}. Supporting so long object numbers in some contexts 
and not in others might cause even more inexplicably weird behaviors.

> Expected number, actual=COSFloat
> --------------------------------
>
>                 Key: PDFBOX-4495
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4495
>             Project: PDFBox
>          Issue Type: Bug
>          Components: PDModel
>            Reporter: Jorge Spinsanti
>            Priority: Major
>
> I got an exception to extract text from pdf using Tika. I can't attach a file 
> to reproduce due to confidentiality.
> {code:java}
> java.io.IOException: expected number, actual=COSFloat{18446744071795507394} 
> at offset 9277 at 
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryValue(BaseParser.java:166)
>  at 
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryNameValuePair(BaseParser.java:279)
>  at 
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionary(BaseParser.java:212)
>  at 
> org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:864) at 
> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:904) at 
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:873)
>  at 
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:793)
>  at 
> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:753) at 
> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:187) at 
> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:226) at 
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1200) at 
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1173) at 
> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:154) at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280){code}
>  Related: https://issues.apache.org/jira/browse/TIKA-2842



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to