Peter Wyatt created PDFBOX-5156:
-----------------------------------

             Summary: Error in identification of PDF comment symbol % as a 
token separator with PDF names
                 Key: PDFBOX-5156
                 URL: https://issues.apache.org/jira/browse/PDFBOX-5156
             Project: PDFBox
          Issue Type: Bug
          Components: Parsing
    Affects Versions: 3.0.0 PDFBox
            Reporter: Peter Wyatt


The DARPA-funded SafeDocs research program has developed a Compacted PDF Syntax 
text case to stress-test PDF lexical analyzers/parsers. See 
[https://github.com/pdf-association/safedocs/tree/main/CompactedSyntax]. The 
output of this test PDF was examined in detail using the PDFBOX debugger "view 
internal structure" feature for both the body and content stream and this is 
the only error... so well done! 

PDFBOX 3.0.0-RC1 was tested using this highly targeted test PDF and there is an 
error in the lexical analysis (token separators) between PDF name objects and 
PDF comments. As specified in ISO 32000-2:
 * clause 7.2.3: "The delimiter characters (, ), <, >, [, ], /, and % are 
special (LEFT PARENTHESIS (28h), RIGHT PARENTHESIS (29h), LESS-THAN SIGN (3Ch), 
GREATER-THAN SIGN (3Eh), LEFT SQUARE BRACKET (5Bh), RIGHT SQUARE BRACKET (5Dh), 
SOLIDUS (2Fh) and PERCENT SIGN (25h), respectively). They delimit syntactic 
entities such as arrays, names, and comments. ... Any of these delimiters 
terminates the entity preceding it and is not included in the entity."
 * clause 7.2.4 "Any occurrence of the PERCENT SIGN (25h) outside a string or 
inside a content stream (see 7.8.2, "Content streams") introduces a comment."

Offset 3561 (as reported in the output below) is in the middle of this fragment 
of PDF: {{<</Root 1 0 R/Info%comment after name}}

Note also that other/earlier versions of PDFBOX were not tested.

{{java -jar pdfbox-app-3.0.0-RC1.jar debug 
safedocs\CompactedSyntax\CompactedPDFSyntaxTest.pdf}}

{{Apr. 08, 2021 9:41:24 AM org.apache.pdfbox.pdfparser.BaseParser 
parseDirObject}}
{{WARNING: Skipped unexpected dir object = 'after' at offset 3561}}
{{Apr. 08, 2021 9:41:24 AM org.apache.pdfbox.pdfparser.BaseParser 
parseCOSDictionaryNameValuePair}}
{{WARNING: Bad dictionary declaration at offset 3562}}
{{Apr. 08, 2021 9:41:24 AM org.apache.pdfbox.pdfparser.BaseParser 
parseCOSDictionary}}
{{WARNING: Invalid dictionary, found: 'n' but expected: '/' at offset 3562}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to