[
https://issues.apache.org/jira/browse/PDFBOX-4495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16802774#comment-16802774
]
Michael Klink commented on PDFBOX-4495:
---------------------------------------
{quote}[~Giorgy]>Are you talking about an issue in PDFBox or corrupt document?
Because the document is open with Chrome or DocumentViewer (Ubuntu) and I can
see the content without problems.
{quote}
First of all, that a document opens with a number of PDF viewers does not
automatically imply that it is not corrupted. PDF viewers (first and foremost
Adobe Reader but then mimicked by other viewers to be compatible with Adobe)
have a long history of repairing numerous errors under the hood.
That being said, an indirect object reference like {{18446744071795507394 0 R}}
apparently _is_ allowed by the specification, at least by its letter. Even
though it is extremely implausible that a PDF will ever contain enough objects
to make an object number as large as {{18446744071795507394}} necessary. Limits
for integers are only implementation limits, not hard limits.
Thus, in this case it most likely (as you haven't shared the PDF, nothing is
sure) is an issue of PDFBox; but it is an issue that _the software that
generated your PDF explicitly wanted to trigger_.
{quote}[~tilman]>return COSNull instead of throwing an exception, as suggested
by Michael Klink
{quote}
Well, by my "strictly speaking" I wanted to imply that I'm not sure whether I
really should suggest implementing it so. Using such object numbers implies
that the generating software explicitly wants to tease PDF processors, and I'm
not sure one has to play along... (Well, or the generating software is broken
and, therefore, generates wrong object numbers. In this case one definitively
doesn't have to play along.)
But if you really want to follow that suggestion, you should check *all* code
that recognizes references and update it accordingly, not only in
{{parseCOSDictionaryValue}}. Supporting so long object numbers in some contexts
and not in others might cause even more inexplicably weird behaviors.
> Expected number, actual=COSFloat
> --------------------------------
>
> Key: PDFBOX-4495
> URL: https://issues.apache.org/jira/browse/PDFBOX-4495
> Project: PDFBox
> Issue Type: Bug
> Components: PDModel
> Reporter: Jorge Spinsanti
> Priority: Major
>
> I got an exception to extract text from pdf using Tika. I can't attach a file
> to reproduce due to confidentiality.
> {code:java}
> java.io.IOException: expected number, actual=COSFloat{18446744071795507394}
> at offset 9277 at
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryValue(BaseParser.java:166)
> at
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryNameValuePair(BaseParser.java:279)
> at
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionary(BaseParser.java:212)
> at
> org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:864) at
> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:904) at
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:873)
> at
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:793)
> at
> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:753) at
> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:187) at
> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:226) at
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1200) at
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1173) at
> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:154) at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280){code}
> Related: https://issues.apache.org/jira/browse/TIKA-2842
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]