[
https://issues.apache.org/jira/browse/PDFBOX-4201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16445284#comment-16445284
]
Tilman Hausherr commented on PDFBOX-4201:
-----------------------------------------
I don't have an immediate idea for fixing this within the parser and I'm not
sure that this should be fixed at all because "Infinity" is not a number token.
(Adobe Reader also mentions an error when trying text extraction) If you
produced these PDFs or if you know who did then the producer should be fixed.
This is either a bug in the pdf library used (here: iTextSharp 4.1.7) or in the
calling application which passed "Infinity" instead of a valid number. You
could also fix the files manually:
* use WriteDecodedDoc on your file (see
[https://pdfbox.apache.org/2.0/commandline.html] )
* open the file with an editor like NOTEPAD++
* overwrite "Infinity" with a number of the *same size*, e.g. 00099999.
* save
> Certain scanned pdfs do not render
> ----------------------------------
>
> Key: PDFBOX-4201
> URL: https://issues.apache.org/jira/browse/PDFBOX-4201
> Project: PDFBox
> Issue Type: Bug
> Components: Parsing
> Affects Versions: 2.0.8
> Reporter: Antonio Contreras
> Priority: Major
> Attachments: PDFBOX-4201-content-stream.txt, testDoc2.pdf
>
>
> I am using PDFBox version 2.0.8. I am trying to render scanned pdfs but there
> are a some that do not render and result in an error. Native pdfs do not
> have any trouble rendering. The majority of the scanned pdfs that I have also
> do not have any trouble rendering but there are a couple that result in an
> error (one is attached).
> This is the code I used to render the pdf.
> {code:java}
> try (PDDocument document = load(file)) {
> logger.debug("start generate image file " + pageNumber + " for " + name);
> PDFRenderer pdfRenderer = new PDFRenderer(document);
> return getPageImage(pdfRenderer, pageNumber, name, storageId);
> }{code}
> The above call to getPageImage calls the following code
> {code:java}
> File imageFile = File.createTempFile(StringUtils.toFilename(storageId) + "_"
> + pageNumber, ".png");
> imageFile.deleteOnExit();
> final BufferedImage image = pdfRenderer.renderImageWithDPI(pageNumber - 1,
> dpi, ImageType.RGB);
> ImageIO.write(image, "png", imageFile);
> logger.debug("completed generate image file " + pageNumber + " for " + name);
> return imageFile;{code}
> The issue occurs in the second code snippet in the line
> {code:java}
> final BufferedImage image = pdfRenderer.renderImageWithDPI(pageNumber - 1,
> dpi, ImageType.RGB);{code}
>
> The stack trace is the following
> {code:java}
> Caused by: java.io.IOException: Error: Expected operator 'ID' actual='In'
> at
> org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:305)
> ~[pdfbox-2.0.8.jar:2.0.8]
> at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:502)
> ~[pdfbox-2.0.8.jar:2.0.8]
> at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:469)
> ~[pdfbox-2.0.8.jar:2.0.8]
> at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150)
> ~[pdfbox-2.0.8.jar:2.0.8]
> at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:203)
> ~[pdfbox-2.0.8.jar:2.0.8]
> at org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:145)
> ~[pdfbox-2.0.8.jar:2.0.8]
> at
> org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:94)
> ~[pdfbox-2.0.8.jar:2.0.8]
> at
> com.sustain.document.PdfPageGenerator.getPageImage(PdfPageGenerator.java:70)
> ~[classes/:?]
> at
> com.sustain.document.PdfPageGenerator.getPageImage(PdfPageGenerator.java:59)
> ~[classes/:?]
> {code}
> Since rendering was not an issue with native pdfs I initially thought that
> only scanned pdfs were an issue. But after other scanned pdfs rendered, I am
> uncertain as to what could be causing some to render and some to error out.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]