Package: pdftk-java
Version: 3.3.3-2
Control: affects -1 exiftool

When updating the Info-dictionary date fields, pdftk-java encodes the date 
string in UTF-16BE with BOM instead of ASCII (or PDFDocEncoding). This causes 
an interoperability issue with exiftool, which does not normalize the dates 
into a human-readable form.

Grab a sample PDF file, say, https://pdfobject.com/pdf/sample.pdf, and try to 
update the creation date there with either update_info or update_info_utf8 (for 
our purposes, pick any):

$ pdftk sample.pdf update_info <(echo -e "InfoBegin\nInfoKey: 
CreationDate\nInfoValue: D:199812231952-08'00'") output sample_with_date.pdf

Exiftool shows what's there, but the creation date is not in a human-readable 
form, whereas the original modification date is way more readable:

$ exiftool -a -G sample_with_date.pdf | grep "PDF.*Date"
[PDF]           Modify Date                     : 2008:07:01 05:24:47Z
[PDF]           Create Date                     : D:199812231952-08'00'

The culprit is the encoding of the date in UTF-16BE, starting with the 
byte-order mark FE FF:

$ mutool show sample_with_date.pdf trailer/Info | grep Date
  /ModDate (D:20080701052447Z00'00')
  /CreationDate 
<FEFF0044003A003100390039003800310032003200330031003900350032002D003000380027003000300027>
$ xxd sample_with_date.pdf | grep -A3 "Dat"
000045c0: 2028 5061 6765 7329 0a2f 4d6f 6444 6174   (Pages)./ModDat
000045d0: 6520 2844 3a32 3030 3830 3730 3130 3532  e (D:20080701052
000045e0: 3434 375a 3030 2730 3027 290a 2f43 7265  447Z00'00')./Cre
000045f0: 6174 696f 6e44 6174 6520 28fe ff00 4400  ationDate (...D.
00004600: 3a00 3100 3900 3900 3800 3100 3200 3200  :.1.9.9.8.1.2.2.
00004610: 3300 3100 3900 3500 3200 2d00 3000 3800  3.1.9.5.2.-.0.8.
00004620: 2700 3000 3000 2729 0a2f 5072 6f64 7563  '.0.0.')./Produc

Instead of UTF-16BE encoding beginning with FE FF, CreationDate should be 
written as an ASCII string:
/CreationDate (D:199812231952-08'00')

In the PDF spec 1.3, 
https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/pdfreference1.3.pdf,
 Table 3.21, a date is a string, and a string is specified as the beginning of 
§ 3.2.3 as a series of bytes—unsigned integer values in the range 0 to 255.

In the PDF spec 1.7, 
https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/PDF32000_2008.pdf,
 Table 34, a date is an ASCII string.

In the PDF spec 2.0, 
https://developer.adobe.com/document-services/docs/assets/5b15559b96303194340b99820d3a70fa/PDF_ISO_32000-2.pdf,
 Table 35, a date is also an ASCII string.

Though § 7.9.4 in specs 1.7 and 2.0 say that the date is a text string, text 
strings can be PDFDocEncoded, and PDFDocEncoding contains printable ASCII.

The aforementioned exiftool output demonstrates inconsistent encoding of date 
fields within the same Info dictionary and reduced interoperability when 
UTF-16BE is used.

Requested change: pdftk-java should write Info date fields using printable 
ASCII (PDFDocEncoding subset) instead of UTF-16BE. Since the PDF date format 
uses only printable ASCII characters, UTF-16BE encoding is unnecessary and 
results in inconsistent encoding and reduced interoperability with some PDF 
tools.

This appears to be an upstream pdftk-java behavior rather than a Debian 
packaging issue.

Reply via email to