[ https://issues.apache.org/jira/browse/PDFBOX-1279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13259196#comment-13259196 ]
Eric Leleu commented on PDFBOX-1279: ------------------------------------ Hi, In the PDF Reference, we can read : "... PDF can be entirely represented using byte values corresponding to the visible printable subset of the ASCII character set, plus white space characters such as space, tab, carriage return, and line feed characters. ASCII is the American Standard Code for Information Interchange, a widely used convention for encoding a specific set of 128 characters as binary numbers. However, a PDF file is not restricted to the ASCII character set; it can contain arbitrary 8-bit bytes,..." So there are no recommended Charset... However instead of UTF-8, the default one should be US-ASCII or ISO-8859-1. The problem comes from the comment line containing at least 4 binary characters (code >= 128) that comes just after the header line. As far as I remember, to match binary characters in JavaCC we must describe them using the Unicode notation (\uxxxx). With the charset CP1252, the character <9F> can't match with the token BINARY([\u0080-\u00FF]), because it is linked with the unicode character \u0178. (See http://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT) So we have 3 possibilities : [1] - Find a way to specify binary charaters without unicode notation in JavaCC [2] - Add all unicode exceptions for the Cp1252 in the Binary token description [3] - Update the BINARY token with [\u0080-\uFFFF] to avoid others charset specificities. I prefer the first one, but if we can't do it maybe the third one will be the best to avoid further issues. With following encodings, I run all my test set with the third option successfully : - US-ASCII - Cp1252 - ISO-8859-1 - utf8 BR, Eric > Preflight reports "1.1 : Body Syntax error" > ------------------------------------------- > > Key: PDFBOX-1279 > URL: https://issues.apache.org/jira/browse/PDFBOX-1279 > Project: PDFBox > Issue Type: Bug > Components: Preflight > Affects Versions: 1.7.0 > Environment: Win 7 64Bit, jre 1.6.31 > Reporter: beat weisskopf > Priority: Minor > Fix For: 1.7.0 > > Attachments: input_pdf_a_lvl_a_libreoffice_352.pdf, > pdfbox_1279_cs.patch > > > Just tried the PDF/A Validation. It fails on the attached pdf with "1.1 : > Body Syntax error". Adobe Preflight reports success for both pdf/a level a > and pdf/a level b validation. PDF was created with plain LibreOffice 3.5.2 > (export as pdf, using pdf/a level a). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira