[ 
https://issues.apache.org/jira/browse/PDFBOX-5025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr resolved PDFBOX-5025.
-------------------------------------
    Fix Version/s: 2.0.33
                   3.0.4 PDFBox
                   4.0.0
       Resolution: Fixed

Thank you; it failed in 2.0 because that one always parses the complete PDF. It 
doesn't fail in 3.0 because that one does parse on demand and the item is never 
used unless accessed directly.

> BaseParser fails when a number is followed by a string starting with 'e'
> ------------------------------------------------------------------------
>
>                 Key: PDFBOX-5025
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5025
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 2.0.21, 2.0.32, 3.0.3 PDFBox
>            Reporter: Cody Wayne Holmes
>            Assignee: Tilman Hausherr
>            Priority: Major
>             Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
>         Attachments: issue2931.pdf, issue3323.pdf
>
>
> I have found an issue in the latest version of PDFBox where parsing fails in 
> the BaseParser when `parseDirObject` parses a number and the following string 
> starts with an 'e'.
>   
>  This is due to the attempt to include numbers stored in scientific notation 
> and the number being followed by the endobject keyword. These are invalid 
> pdfs that don't contain a new line after the number before the 'endobject' 
> keyword, but the failure can be prevented.
>   
>  I have found one way that seems to resolve this problem is by checking if 
> the last character in the read number string is an e or E. If it is then 
> removing it from the read string and unreading it from the source allows 
> parsing to complete as expected.
>   
> {code:java}
> private COSNumber parseCOSNumber() throws IOException
> {
>   ...
>   // Remove last character if it is not a number
>   char lastc = buf.charAt(buf.length() - 1);
>   if (lastc == 'e' || lastc == 'E')
>   { 
>     buf.deleteCharAt(buf.length() - 1);
>     seqSource.unread(lastc);
>   }
>   return COSNumber.get(buf.toString());
> }
> {code}
>   
>  An example of this error can be seen in PDF.js issue3323.
> [https://github.com/mozilla/pdf.js/blob/4ba28de2608866dcb10d627d77dc19ff3d017c17/test/pdfs/issue3323.pdf]
> Some more pdfs were attached as well with the issue.
>   
>  
> [https://github.com/mozilla/pdf.js/commit/26f5b1b2d37c7b74a073dee75d66fcc04fae10e8]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to