[ 
https://issues.apache.org/jira/browse/PDFBOX-1279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13259196#comment-13259196
 ] 

Eric Leleu commented on PDFBOX-1279:
------------------------------------

Hi,

In the PDF Reference, we can read :

"... PDF can be entirely represented using byte values corresponding to the 
visible printable subset of the ASCII character set, plus white space 
characters such as space, tab, carriage return, and line feed characters. ASCII 
is the American Standard Code for Information Interchange, a widely used 
convention for encoding a specific set of 128 characters as binary numbers. 
However, a PDF file is not restricted to the ASCII character set; it can 
contain arbitrary 8-bit bytes,..."

So there are no recommended Charset... However instead of UTF-8, the default 
one should be US-ASCII or ISO-8859-1.

The problem comes from the comment line containing at least 4 binary characters 
(code >= 128) that comes just after the header line. As far as I remember, to 
match binary characters in JavaCC we must describe them using the Unicode 
notation (\uxxxx). With the charset CP1252, the character <9F> can't match with 
the token BINARY([\u0080-\u00FF]), because it is linked with the unicode 
character \u0178. (See 
http://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT)

So we have 3 possibilities :

[1] - Find a way to specify binary charaters without unicode notation in JavaCC

[2] - Add all unicode exceptions for the Cp1252 in the Binary token description

[3] - Update the BINARY token with [\u0080-\uFFFF] to avoid others charset 
specificities.


I prefer the first one, but if we can't do it maybe the third one will be the 
best to avoid further issues.

With following encodings, I run all my test set with the third option 
successfully  :
- US-ASCII
- Cp1252
- ISO-8859-1
- utf8 


BR,
Eric
                
> Preflight reports "1.1 : Body Syntax error"
> -------------------------------------------
>
>                 Key: PDFBOX-1279
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1279
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Preflight
>    Affects Versions: 1.7.0
>         Environment: Win 7 64Bit, jre 1.6.31
>            Reporter: beat weisskopf
>            Priority: Minor
>             Fix For: 1.7.0
>
>         Attachments: input_pdf_a_lvl_a_libreoffice_352.pdf, 
> pdfbox_1279_cs.patch
>
>
> Just tried the PDF/A Validation. It fails on the attached pdf with "1.1 : 
> Body Syntax error". Adobe Preflight reports success for both pdf/a level a 
> and pdf/a level b validation. PDF was created with plain LibreOffice 3.5.2 
> (export as pdf, using pdf/a level a).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to