Carlos (I believe) pointed me at a document with a form-editing bug, at http://bugzilla.gnome.org/show_bug.cgi?id=365807. The text in the upper-right corner is actually a multi-line form field. If you click on that text, only the first is made available for editing. However, editing the field to include additional lines still works.

The problem has to do with the conversion of strings from PDFDocEncoding to Unicode. The lookup table for the conversion does not know what to do with a carriage return, and so maps it to U+0000. When passed up to evince for editing, the null character ends the string early, at the first newline. The value of the field is initially stored in PDFDocEncoding; when we edit it, we store the results back as a Unicode string.

The fix: add carriage return and a few other characters to the PDFDocEncoding table. Map them to the corresponding Unicode characters (same numeric value). In this patch, I'm only adding mappings for whitespace characters, not all control characters. I contemplated adding mappings for all control characters, but it's not possible to do a complete job since some bytes <0x20 are used for glyphs already.

While making this change, I also updated the table so that any unknown characters are now mapped to U+FFFD (conventionally used to represent a character that couldn't be converted) instead of U+0000. This should prevent an unknown character in a PDFDocEncoding string from being turned into a null in the future.

--Michael Vrable
Provide Unicode mappings for some control characters in PDFDocEncoding.

Though they do not represent glyphs, values such as carriage return can be
found in text strings in PDFDocEncoding.  Provide mappings for these bytes
to Unicode.

Additionally, map unknown characters to U+FFFD instead of U+0000, so that
unknown characters do not result in nulls (which can truncate strings
early, particularly if the string is later re-encoded into null-terminated
UTF-8).
---
 poppler/PDFDocEncoding.cc |   23 +++++++++++++++++------
 1 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/poppler/PDFDocEncoding.cc b/poppler/PDFDocEncoding.cc
index 89dc382..1d3ea74 100644
--- a/poppler/PDFDocEncoding.cc
+++ b/poppler/PDFDocEncoding.cc
@@ -8,10 +8,21 @@
 
 #include "PDFDocEncoding.h"
 
+// Mapping of PDFDocEncoding (used to represent text values such as document
+// metadata or annotation text) to Unicode codepoints.  Not all 8-bit values in
+// PDFDocEncoding are defined; undefined bytes are mapped to U+FFFD (Unicode
+// replacement character).
+//
+// PDFDocEncoding is only directly defined for printable characters, but some
+// control characters such as carriage return will still be used.  We define
+// mappings of the standard whitespace control characters (tabs, newlines) to
+// the corresponding Unicode values.  Other control characters are left
+// undefined.
+
 Unicode pdfDocEncoding[256] = {
-  0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, // 00
-  0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
-  0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, // 10
+  0x0000, 0xfffd, 0xfffd, 0xfffd, 0xfffd, 0xfffd, 0xfffd, 0xfffd, // 00
+  0xfffd, 0x0009, 0x000a, 0xfffd, 0x000c, 0x000d, 0xfffd, 0xfffd,
+  0xfffd, 0xfffd, 0xfffd, 0xfffd, 0xfffd, 0xfffd, 0xfffd, 0xfffd, // 10
   0x02d8, 0x02c7, 0x02c6, 0x02d9, 0x02dd, 0x02db, 0x02da, 0x02dc,
   0x0020, 0x0021, 0x0022, 0x0023, 0x0024, 0x0025, 0x0026, 0x0027, // 20
   0x0028, 0x0029, 0x002a, 0x002b, 0x002c, 0x002d, 0x002e, 0x002f,
@@ -24,13 +35,13 @@ Unicode pdfDocEncoding[256] = {
   0x0060, 0x0061, 0x0062, 0x0063, 0x0064, 0x0065, 0x0066, 0x0067, // 60
   0x0068, 0x0069, 0x006a, 0x006b, 0x006c, 0x006d, 0x006e, 0x006f,
   0x0070, 0x0071, 0x0072, 0x0073, 0x0074, 0x0075, 0x0076, 0x0077, // 70
-  0x0078, 0x0079, 0x007a, 0x007b, 0x007c, 0x007d, 0x007e, 0x0000,
+  0x0078, 0x0079, 0x007a, 0x007b, 0x007c, 0x007d, 0x007e, 0xfffd,
   0x2022, 0x2020, 0x2021, 0x2026, 0x2014, 0x2013, 0x0192, 0x2044, // 80
   0x2039, 0x203a, 0x2212, 0x2030, 0x201e, 0x201c, 0x201d, 0x2018,
   0x2019, 0x201a, 0x2122, 0xfb01, 0xfb02, 0x0141, 0x0152, 0x0160, // 90
-  0x0178, 0x017d, 0x0131, 0x0142, 0x0153, 0x0161, 0x017e, 0x0000,
+  0x0178, 0x017d, 0x0131, 0x0142, 0x0153, 0x0161, 0x017e, 0xfffd,
   0x20ac, 0x00a1, 0x00a2, 0x00a3, 0x00a4, 0x00a5, 0x00a6, 0x00a7, // a0
-  0x00a8, 0x00a9, 0x00aa, 0x00ab, 0x00ac, 0x0000, 0x00ae, 0x00af,
+  0x00a8, 0x00a9, 0x00aa, 0x00ab, 0x00ac, 0xfffd, 0x00ae, 0x00af,
   0x00b0, 0x00b1, 0x00b2, 0x00b3, 0x00b4, 0x00b5, 0x00b6, 0x00b7, // b0
   0x00b8, 0x00b9, 0x00ba, 0x00bb, 0x00bc, 0x00bd, 0x00be, 0x00bf,
   0x00c0, 0x00c1, 0x00c2, 0x00c3, 0x00c4, 0x00c5, 0x00c6, 0x00c7, // c0
_______________________________________________
poppler mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/poppler

Reply via email to