[
https://issues.apache.org/jira/browse/PDFBOX-4800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andreas Lehmkühler resolved PDFBOX-4800.
----------------------------------------
Fix Version/s: 3.0.0 PDFBox
2.0.20
Resolution: Fixed
> Parsing of numbers does not always terminate at actual end of number
> --------------------------------------------------------------------
>
> Key: PDFBOX-4800
> URL: https://issues.apache.org/jira/browse/PDFBOX-4800
> Project: PDFBox
> Issue Type: Bug
> Components: Parsing
> Affects Versions: 2.0.12, 2.0.15, 2.0.19
> Reporter: Eckhart Pedersen
> Assignee: Andreas Lehmkühler
> Priority: Major
> Fix For: 2.0.20, 3.0.0 PDFBox
>
> Attachments: 1584634522723.txt, demobank_case_error_doc1.pdf,
> demobank_case_ok_doc1.pdf
>
>
> *Short description:*
> The method *readStringNumber* in *BaseParser.java* reads more characters
> than desired when parsing numbers in certain documents. We have internally
> fixed the issue by adding the following line ({color:#de350b}marked with
> red{color}):
> {color:#505f79}while( (lastByte = seqSource.read() ) != _ASCII_SPACE_ **
> &&{color}
> {color:#505f79} lastByte != _ASCII_LF_ ** &&{color}
> {color:#505f79} lastByte != _ASCII_CR_ ** &&{color}
> {color:#505f79} lastByte != 60 && _//see sourceforge bug
> 1714707_{color}
> {color:#505f79} __ lastByte != '[' && _//
> PDFBOX-1845_{color}
> {color:#505f79} __ lastByte != '(' && _//
> PDFBOX-2579_{color}
> {color:#505f79} __ lastByte != 0 && _//See
> sourceforge bug 853328_{color}
> {color:#de350b} __ *lastByte != '/' &&*{color}
> {color:#505f79} lastByte != -1 ){color}
> {color:#505f79} {{color}
> *Background:*
> Our customer ran into an issue with certain documents that were converted to
> PDF/A2 format with Qoppa jPDFPreflight
> ([https://www.qoppa.com/pdfpreflight/]). In some instances pdfbox would
> afterwards fail to open the document.
> (It is possible that the Qoppa conversion tool does something wrong and that
> the resulting PDF is invalid somehow, but all other tools seem to open the
> converted documents without any problems. We are not PDF experts, so this is
> difficult for us to judge. If you determine that the problematic PDF document
> is incorrect somehow, please notify us so that we can create a bug report at
> Qoppa also.)
> I am attaching both an original version of the document (which pdfbox can
> open just fine) and the converted version (which pdfbox cannot parse
> correctly).
> *Additional information*
> **My colleague refers to ISO 32000-1 section 7.2.2 which describes all valid
> white-space and delimiter characters for PDF.
> According to the list of delimiter/white-space characters the following
> characters should also be handled in the readStringNumber method: '%','\{',
> ')', ']', '}', '>' , FORM FEED, and HORIZONTAL TAB.
> Though again, as we are not experts on the PDF standard we recommend that you
> check the mentioned standard documents yourself and determine what kind of
> solution you want to implement (if any).
> *Final Note:*
> We are filing this bug report in the hope that you find it helpful. I have
> tried to include all relevant information as well as I can, if you have
> further questions, I would be happy to address them as well as I can.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]