[jira] [Updated] (PDFBOX-4800) Parsing of numbers does not always terminate at actual end of number

Eckhart Pedersen (Jira) Fri, 20 Mar 2020 00:20:25 -0700


     [ 
https://issues.apache.org/jira/browse/PDFBOX-4800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Eckhart Pedersen updated PDFBOX-4800:
-------------------------------------
    Description: 
*Short description:*
 The method *readStringNumber* in *BaseParser.java* reads more characters than 
desired when parsing numbers in certain documents. We have internally fixed the 
issue by adding the following line ({color:#de350b}marked with red{color}):

{color:#505f79}while( (lastByte = seqSource.read() ) != _ASCII_SPACE_ ** 
&&{color}
 {color:#505f79}             lastByte != _ASCII_LF_ ** &&{color}
 {color:#505f79}             lastByte != _ASCII_CR_ ** &&{color}
 {color:#505f79}             lastByte != 60 && _//see sourceforge bug 
1714707_{color}
 {color:#505f79}             __             lastByte != '[' && _// 
PDFBOX-1845_{color}
 {color:#505f79}             __             lastByte != '(' && _// 
PDFBOX-2579_{color}
 {color:#505f79}             __             lastByte != 0 && _//See sourceforge 
bug 853328_{color}
 {color:#de350b}             __             *lastByte != '/' &&*{color}
 {color:#505f79}        lastByte != -1 ){color}
 {color:#505f79}     {{color}

*Background:*
 Our customer ran into an issue with certain documents that were converted to 
PDF/A2 format with Qoppa jPDFPreflight ([https://www.qoppa.com/pdfpreflight/]). 
In some instances pdfbox would afterwards fail to open the document.

(It is possible that the Qoppa conversion tool does something wrong and that 
the resulting PDF is invalid somehow, but all other tools seem to open the 
converted documents without any problems. We are not PDF experts, so this is 
difficult for us to judge. If you determine that the problematic PDF document 
is incorrect somehow, please notify us so that we can create a bug report at 
Qoppa also.)

I am attaching both an original version of the document (which pdfbox can open 
just fine) and the converted version (which pdfbox cannot parse correctly).

*Additional information*

**My colleague refers to ISO 32000-1 section 7.2.2 which describes all valid 
white-space and delimiter characters for PDF.

According to the list of delimiter/white-space characters the following 
characters should also be handled in the readStringNumber method: '%','\{', 
')', ']', '}', '>' , FORM FEED, and HORIZONTAL TAB.

Though again, as we are not experts on the PDF standard we recommend that you 
check the mentioned standard documents yourself and determine what kind of 
solution you want to implement (if any).

*Final Note:*

We are filing this bug report in the hope that you find it helpful. I have 
tried to include all relevant information as well as I can, if you have further 
questions, I would be happy to address them as well as I can.

  was:
*Short description:*
The method *readStringNumber* in *BaseParser.java* fails to terminate parsing 
of numbers for certain documents. We have internally fixed the issue by adding 
the following line ({color:#de350b}marked with red{color}):

{color:#505f79}while( (lastByte = seqSource.read() ) != _ASCII_SPACE_ ** 
&&{color}
{color:#505f79}             lastByte != _ASCII_LF_ ** &&{color}
{color:#505f79}             lastByte != _ASCII_CR_ ** &&{color}
{color:#505f79}             lastByte != 60 && _//see sourceforge bug 
1714707_{color}
{color:#505f79}             __             lastByte != '[' && _// 
PDFBOX-1845_{color}
{color:#505f79}             __             lastByte != '(' && _// 
PDFBOX-2579_{color}
{color:#505f79}             __             lastByte != 0 && _//See sourceforge 
bug 853328_{color}
{color:#de350b}             __             *lastByte != '/' &&*{color}
 {color:#505f79}        lastByte != -1 ){color}
{color:#505f79}     {{color}
 
*Background:*
Our customer ran into an issue with certain documents that were converted to 
PDF/A2 format with Qoppa jPDFPreflight ([https://www.qoppa.com/pdfpreflight/]). 
In some instances pdfbox would afterwards fail to open the document.

(It is possible that the Qoppa conversion tool does something wrong and that 
the resulting PDF is invalid somehow, but all other tools seem to open the 
converted documents without any problems. We are not PDF experts, so this is 
difficult for us to judge. If you determine that the problematic PDF document 
is incorrect somehow, please notify us so that we can create a bug report at 
Qoppa also.)

I am attaching both an original version of the document (which pdfbox can open 
just fine) and the converted version (which pdfbox cannot parse correctly).

*Additional information*

**My colleague refers to ISO 32000-1 section 7.2.2 which describes all valid 
white-space and delimiter characters for PDF.

According to the list of delimiter/white-space characters the following 
characters should also be handled in the readStringNumber method: '%','\{', 
')', ']', '}', '>' , FORM FEED, and HORIZONTAL TAB.

Though again, as we are not experts on the PDF standard we recommend that you 
check the mentioned standard documents yourself and determine what kind of 
solution you want to implement (if any).

*Final Note:*

We are filing this bug report in the hope that you find it helpful. I have 
tried to include all relevant information as well as I can, if you have further 
questions, I would be happy to address them as well as I can.


> Parsing of numbers does not always terminate at actual end of number
> --------------------------------------------------------------------
>
>                 Key: PDFBOX-4800
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4800
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 2.0.12, 2.0.15, 2.0.19
>            Reporter: Eckhart Pedersen
>            Assignee: Andreas Lehmkühler
>            Priority: Major
>         Attachments: 1584634522723.txt, demobank_case_error_doc1.pdf, 
> demobank_case_ok_doc1.pdf
>
>
> *Short description:*
>  The method *readStringNumber* in *BaseParser.java* reads more characters 
> than desired when parsing numbers in certain documents. We have internally 
> fixed the issue by adding the following line ({color:#de350b}marked with 
> red{color}):
> {color:#505f79}while( (lastByte = seqSource.read() ) != _ASCII_SPACE_ ** 
> &&{color}
>  {color:#505f79}             lastByte != _ASCII_LF_ ** &&{color}
>  {color:#505f79}             lastByte != _ASCII_CR_ ** &&{color}
>  {color:#505f79}             lastByte != 60 && _//see sourceforge bug 
> 1714707_{color}
>  {color:#505f79}             __             lastByte != '[' && _// 
> PDFBOX-1845_{color}
>  {color:#505f79}             __             lastByte != '(' && _// 
> PDFBOX-2579_{color}
>  {color:#505f79}             __             lastByte != 0 && _//See 
> sourceforge bug 853328_{color}
>  {color:#de350b}             __             *lastByte != '/' &&*{color}
>  {color:#505f79}        lastByte != -1 ){color}
>  {color:#505f79}     {{color}
> *Background:*
>  Our customer ran into an issue with certain documents that were converted to 
> PDF/A2 format with Qoppa jPDFPreflight 
> ([https://www.qoppa.com/pdfpreflight/]). In some instances pdfbox would 
> afterwards fail to open the document.
> (It is possible that the Qoppa conversion tool does something wrong and that 
> the resulting PDF is invalid somehow, but all other tools seem to open the 
> converted documents without any problems. We are not PDF experts, so this is 
> difficult for us to judge. If you determine that the problematic PDF document 
> is incorrect somehow, please notify us so that we can create a bug report at 
> Qoppa also.)
> I am attaching both an original version of the document (which pdfbox can 
> open just fine) and the converted version (which pdfbox cannot parse 
> correctly).
> *Additional information*
> **My colleague refers to ISO 32000-1 section 7.2.2 which describes all valid 
> white-space and delimiter characters for PDF.
> According to the list of delimiter/white-space characters the following 
> characters should also be handled in the readStringNumber method: '%','\{', 
> ')', ']', '}', '>' , FORM FEED, and HORIZONTAL TAB.
> Though again, as we are not experts on the PDF standard we recommend that you 
> check the mentioned standard documents yourself and determine what kind of 
> solution you want to implement (if any).
> *Final Note:*
> We are filing this bug report in the hope that you find it helpful. I have 
> tried to include all relevant information as well as I can, if you have 
> further questions, I would be happy to address them as well as I can.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (PDFBOX-4800) Parsing of numbers does not always terminate at actual end of number

Reply via email to