Re: PDFName.getName() returns escaped name?!

Craig Ringer Thu, 29 Mar 2012 17:24:35 -0700

On 03/30/2012 05:09 AM, J.Pietschmann wrote:

Am 29.03.2012 01:24, schrieb Craig Ringer:

I'd also like to have getEncodedName() return a byte[] not a
String, since an encoded PDF name isn't actually text data.

Sounds like a reasonable idea.

BTW, is there any reason Fop's PDF library uses java.lang.String when
working with sequences of PDF data bytes?

I'd chalk this up to "historical reasons", as usual. Fell free to
provide a patch which cleans this up.

J.Pietschmann

Here's how I'd like to rewrite PDFName; untested code as an example ofwhat I'm getting at. This is just a standalone file; a patch thatincorporates it into the main sources will be a lot more work that I'mholding off on until I know folks here agree with the approach.

In any case, after reading more of the PDF library I'm rethinking thewisdom of trying to make this change. The change its self is correct,but it'll be really hard to safely integrate into the rest of the PDFlibrary because of the difficulty of auditing every site to ensurenothing breaks. Java likes to call `toString' automatically in places,meaning that anywhere that doesn't use the proper PDFWritable outputmethods PDFName inherits will break by producing bad PDF data that mightbe quite hard to spot. I'd start by making PDFName.toString() throw (fortesting), but that'd only catch issues in code that test paths actually hit.

Given the number of these kinds of issues in fop's pdf library I'm moreand more inclined to wonder if it should just be replaced with PDFBox.It's *full* of text encoding issues, it crams 8-bit binary data into thelower 8 bits of Unicode strings, etc. Most of the classes that extendbasics like PDFDictionary act like the base class isn't public API andbreak if anyone else changes the dictionary in ways they don't expect,too; they should "have-a" PDFDictionary not "be-a" PDFDictionary really.

PDFBox is far from perfect, but it has a clean separation between themodel classes (PDxxxx) and the basic PDF data types (COSxxx); it has aclean PDFName, PDFString, etc; it has a good PDF parser already, etc.Maybe it'd be easier for me to whip up a port of FOP's PDF output codeto PDFBox? I suspect I'm insane to mention the possibility of doing thatwithout evaluating the amount of work involved first, so I'm notpromising anything, but by the looks it might be easier than doing thecleanups I'd like to do in fop.


Thoughts?

--
Craig Ringer

/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *      http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

/* $Id$ */

package org.apache.fop.pdf;

import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.io.OutputStream;
import java.io.Serializable;
import java.nio.charset.CharsetEncoder;
import java.nio.charset.CodingErrorAction;
import java.util.*;
import org.apache.commons.io.output.CountingOutputStream;

/**
 * Class representing a PDF name object.
 * 
 */
public class PDFName extends PDFObject {
    

    private static final Map<ByteString, PDFName> commonNames;
    
    private final ByteString unescapedName;
    private ByteString escapedName;
    
    /**
     * Creates a new PDF name object from a Unicode java string,
     * encoding the name as UTF-8.
     * 
     * @param name the name value
     */
    public PDFName(String name) {
        super();
        this.unescapedName = new ByteString(name.getBytes(java.nio.charset.StandardCharsets.UTF_8));
    }
    
    /**
     * Creates a new PDF name object from a sequence of bytes
     * in no particular encoding.
     * 
     * By PDF convention you should use utf-8 when encoding names
     * (as is done by the String-based PDFName constructor), but this
     * is NOT required by the spec.
     */
    private PDFName(ByteString name) {
        super();
        this.unescapedName = name;
    }
    
    /**
     * Create a PDFName with a pre-escaped name supplied. This is mostly useful
     * when defining names from data parsed from PDF data, or when allocating
     * pre-cached names.
     * 
     * @param unescapedName Name with PDF name escapes decoded
     * @param escapedName Name encoded with PDF escapes
     */
    private PDFName(ByteString unescapedName, ByteString escapedName) {
        this.unescapedName = unescapedName;
        this.escapedName = escapedName;
    }
    
    /**
     * Create a PDFName based on the bytes in `name'
     * 
     * @param name 
     * @return 
     */
    public static PDFName create(ByteString name) {
        PDFName n = commonNames.get(name);
        if (n == null) {
            n = new PDFName(name);
        }
        return n;
    }
    
    private static final List<Byte> ESCAPED_NAME_CHARS 
            = Arrays.asList(new Byte[]{'/', '(', ')', '<', '>', '[', ']', '%', '#'});

    private static final byte[] DIGITS
        = {'0', '1', '2', '3', '4', '5', '6', '7',
           '8', '9', 'A', 'B', 'C', 'D', 'E', 'F'};
    
    /**
     * Perform PDF name escaping on a sequence of bytes.
     * 
     * It is recommended by the PDF specification that names be UTF-8 encoded
     * before escaping, at least for names that are intended to have a meaning
     * as text. You can pass any byte sequence that's legal as a name, but should
     * generally stick to UTF-8 to avoid issues with readers and PDF inspection
     * software.
     * 
     * FOP likes to shove Latin-1 data into the lower byte of each `char' in a
     * String, producing garbage text. Do <b>not</b> just call String.encode("UTF-8")
     * on such a string and pass it to this function if there's any chance it
     * has non-7-bit chars in it; unmangle it to a valid Unicode string first.
     * 
     * @param bytes Byte sequence to PDF name escape, preferably UTF-8 encoded text
     * @return PDF-name-escaped representation of input bytes
     */
    private static ByteString pdfNameEscapeBytes(ByteString bytes) {       
        // Most likely the name will be 7-bit chars so it won't
        // require any extra space. It's worth checking quickly so we can
        // just return the name unchanged rather than allocating a ByteArrayOutputStream,
        // copying the data, etc.
        boolean needsEscape = false;
        for (byte b : bytes) {
            if (b < 33 || b > 126 || ESCAPED_NAME_CHARS.indexOf(b) >= 0) {
                needsEscape = true;
                break;
            }
        }
        if (needsEscape) {
            ByteArrayOutputStream encoded = new ByteArrayOutputStream(bytes.size());
            for (byte b : bytes) {
                if (b < 33 || b > 126 || ESCAPED_NAME_CHARS.indexOf(b) >= 0) {
                    // Write the `#' for an escape sequence
                    encoded.write(35);
                    // then the hex-escaped byte
                    encoded.write(DIGITS[b >>> 4 & 0x0F]);
                    encoded.write(DIGITS[b & 0x0F]);
                } else {
                    encoded.write(b);
                }
            }
            bytes = ByteString.asWrapperOf(encoded.toByteArray());
        }
        return bytes;
    }
    
    /**
     * Get the name escaped according to PDF name escaping rules.
     * 
     * @return Escaped PDF name
     */
    public ByteString getEscapedName() {
        if (escapedName == null) {
            escapedName = pdfNameEscapeBytes(unescapedName);
        }
        return escapedName;
    }

    /** 
     * Get a String representation of the PDF name suitable for printing
     * and debug display.
     *
     * Don't use this method for writing PDF names to a PDF file; use
     * PDFName.output(...) .
     * 
     * The name is not decoded as any particular text encoding; the raw,
     * pdf-name-escaped form is output.
     * 
     * No leading slash is output; the slash isn't part of a name.
     */
    @Override
    public String toString() {
        // Why doesn't this name decode as UTF-8 and return the UTF-8 decoded
        // name? A couple of reasons, but they all boil down to the fact that
        // any attempt at decoding can fail or have to substitute non-decodable
        // chars in the input, leading to output that's not necessarily accurate
        // and not distinguishable from a different name. Consider:
        //
        // The char "Â¾" is utf-8 "\xc2\xbe" and latin-1 "\xbe". If we have the name
        //     SizeÂ¾
        // encoded in latin-1 in a PDF we won't be able to decode it as utf-8
        // (as per PDF-convention) and don't know what other encoding it's in.
        // That means we have to either produce some escaped-bytes form like
        // literally printing:
        //
        //     Size\xbe
        // or
        //     Size#BE
        //
        // ... or we have to use a substitute char like:
        //
        //     Size?
        //
        // Unfortunately, ALL THREE OF THE ABOVE ARE VALID, LEGAL NAMES that
        // could be produced by decoding UTF-8 encoded byte sequences.
        // In other words, presuming we printed PDF-name-style hex escapes
        // as a fallback, the two raw PDF-name-escaped names:
        //
        //     Size#35#BE
        // 
        // and
        //
        //     Size#BE
        //
        // would both print as "Size#BE" - the first because it really is the
        // name "Size#BE" with the "#" escaped, and the second because UTF-8
        // decoding of "Size"+'\xBE' failed so we fell back to printing the raw
        // escaped form.
        //
        // The same problem arises with any other escaping scheme, so the only
        // sane thing to do since the PDF spec permits non-utf-8 names is print
        // the raw name with PDF escaping.
        //
        // Because of PDF escaping, our string is always US ASCII compatible, so
        // we can just encode it.
        //
        return new String(getEscapedName().getWrappedArray(), java.nio.charset.StandardCharsets.US_ASCII);
        
    }

    /**
     * Returns the non-escaped name, with no leading slash.
     * 
     * By PDF convention this is a sequence of UTF-8 encoded bytes, but there
     * is no guarantee that a PDF producer will use UTF-8 and the same text
     * encoded two different ways <i>forms two different names</i> in PDF.
     * 
     * The returned string is not necessarily valid UTF-8 if this name was
     * created by a plugin or 3rd party tool.
     *
     * @return the non-escaped name without the leading slash
     */
    public ByteString getName() {
        return escapedName;
    }
    
    /** {@inheritDoc} */
    @Override
    public boolean equals(Object obj) {
        if (!(obj instanceof PDFName)) {
            return false;
        }
        PDFName other = (PDFName)obj;
        return unescapedName.equals(other.unescapedName);
    }

    /** {@inheritDoc} */
    @Override
    public int hashCode() {
        return unescapedName.hashCode();
    }

    @Override
    public int output(OutputStream stream) throws IOException {
        CountingOutputStream cout = new CountingOutputStream(stream);
        StringBuilder textBuffer = new StringBuilder(64);
        textBuffer.append(toString());
        PDFDocument.flushTextBuffer(textBuffer, cout);
        return cout.getCount();
    }

    @Override
    public void outputInline(OutputStream out, StringBuilder textBuffer) throws IOException {
        if (hasObjectNumber()) {
            textBuffer.append(referencePDF());
        } else {
            textBuffer.append(toString());
        }
    }
    
    static {
        // The set of names used in the PDF standard. This set need not be exhaustive
        // as it is really just a memory use optimisation.
        //
        // Names in this list MUST be US-ASCII without spaces or other
        // chars that require PDF escaping, because they're assumed to be
        // valid as both unescaped and pre-escaped names.
        //
        String[] stdNames = new String[]{
            "A", "AA", "AcroForm", "AIS", "Annot", "Annots", "AntiAlias", 
            "APRef", "AS", "ASCII85Decode", "A85", "Ascent", "ASCIIHexDecode", 
            "AHx", "AP", "Author", "AvgWidth", "B", "Background", "BaseEncoding", 
            "BaseFont", "BaseState", "BBox", "BlackIs1", "BlackPoint", "Catalog", 
            "C0", "C1", "CA", "ca", "CalGray", "CalRGB", "CapHeight", 
            "CCITTFaxDecode", "CCF", "CenterWindow", "CF", "CFM", "CharProcs", 
            "CharSet", "CIDFontType0", "CIDFontType2", "CIDSystemInfo", 
            "CIDToGIDMap", "Colorants", "Colors", "ColorSpace", "Columns", 
            "Contents", "Coords", "Count", "ClrF", "ClrFf", "CreationDate", 
            "Creator", "Crypt", "D", "DA", "DCTDecode", "DCT", "Decode", 
            "DecodeParms", "Dests", "DeviceCMYK", "DeviceGray", "DeviceN", 
            "DeviceRGB", "Differences", "Direction", "DisplayDocTitle", "DL", 
            "DocChecksum", "Domain", "DP", "DR", "Duplex", "DV", "DW", 
            "EmbeddedFiles", "Encode", "Encoding", "90ms-RKSJ-H", "90ms-RKSJ-V", 
            "ETen?B5?H", "ETen?B5?V", "Encrypt", "EncryptMetadata", 
            "ExtGState", "Extend", "Extends", "F", "FDecodeParms", "FFilter", 
            "Ff", "Fields", "Filter", "FirstChar", "FitWindow", "FL", "Flags", 
            "FlateDecode", "Fl", "Font", "FontBBox", "Form", "FormType", "FRM", 
            "FT", "Function", "FunctionType", "Functions", "Gamma", "H", "Height",
            "HideMenubar", "HideToolbar", "HideWindowUI", "ICCBased", "Identity", 
            "Identity-H", "Image", "ImageMask", "Index", "Indexed", "Info", 
            "ItalicAngle", "JavaScript", "JBIG2Decode", "JPXDecode", "Keywords", 
            "Kids", "Lab", "LastChar", "LastModified", "LC", "Leading", "Length", 
            "Length1", "Limits", "LJ", "LW", "LZWDecode", "LZW", "M", 
            "MacRomanEncoding", "Mask", "Matrix", "MaxWidth", "MissingWidth", 
            "ML", "N", "Name", "Names", "Next", "NM", "NonFullScreenPageMode", 
            "Nums", "ObjStm", "OP", "op", "OPM", "Opt", "Order", "Ordering", 
            "P", "Page", "Pages", "PaintType", "Parent", "Pattern", "PatternType", 
            "PDFDocEncoding", "Predictor", "Prev", "PrintArea", "PrintClip", 
            "PrintScaling", "ProcSet", "Producer", "Properties", "Q", "R", 
            "Range", "Recipients", "Rect", "Registry", "Resources", "RI", "Root",
            "RunLengthDecode", "RL", "RV", "SA", "SE", "Separation", "SetF", "SetFf",
            "Shading", "ShadingType", "SM", "SMask", "Size", "StandardEncoding", 
            "StdCF", "StemH", "StemV", "StmF", "StrF", "SubFilter", "Subj", 
            "Subject", "Supplement", "Subtype", "TilingType", "Title", "TK", 
            "ToUnicode", "Type", "U", "V", "Version", "VerticesPerRow", "W", 
            "Width", "Widths", "WinAnsiEncoding", "WhitePoint", "XHeight", 
            "XObject", "XRef", "XStep", "YStep"
        };
        commonNames = new HashMap<ByteString,PDFName>();
        for (String name : stdNames) {
            ByteString nameBytes = ByteString.asWrapperOf(name.getBytes(java.nio.charset.StandardCharsets.US_ASCII));            
            commonNames.put(nameBytes, new PDFName(nameBytes, nameBytes));
        }
    }
    
    
    /**
     * Immutable byte string class. This really should be one of the existing
     * well-known byte string classes that supports interning small values,
     * etc; this is just a minimal example to show the approach I'm talking
     * about.
     */
    public static class ByteString implements Serializable, PDFWritable, Iterable<Byte> {
        byte[] data;
        
        private ByteString(byte[] data) {
            if (data == null) {
                throw new NullPointerException("ByteString can't wrap a null byte array");
            }
            this.data = data;
        }
        
        /**
         * Wrap `data' in a ByteString without copying it.
         * 
         * @warning the caller MUST NOT ALTER THE WRAPPED BYTE ARRAY. This method
         *          is an optimisation that's potentially dangerous and should
         *          only be used where the byte array passed is provably not
         *          altered - usually where the ByteString is short lived or the
         *          array is soon discarded.
         * 
         * @param data  Byte array to wrap without copying
         * @return ByteArray wrapping `data'
         */
        public static ByteString asWrapperOf(byte[] data) {
            return new ByteString(data);
        }
        
        /**
         * Copy `data' into a new ByteString
         * 
         * No refernce to `data' is retained, and it may be safely altered
         * after this method returns.
         * 
         * @param data Byte sequence to create the ByteString with
         * @return ByteString containing a copy of `data'
         */
        public static ByteString asCopyOf(byte[] data) {
            return new ByteString(Arrays.copyOf(data, data.length));
        }
        
        public void outputInline(OutputStream out, StringBuilder textBuffer) throws IOException {
            PDFDocument.flushTextBuffer(textBuffer, out);
            out.write(data);
        }
        
        /**
         * Produce a human readable represenation of the binary data array
         * (same as Arrays.toString(byte[]) produces).
         * 
         * This is ***NOT*** a Java string with the bytes crammed into it.
         * 
         * @return Human readable representation of binary byte array
         */
        @Override
        public String toString() {
            // Human-readable representation of the binary data
            return Arrays.toString(data);
        }
        
        /**
         * Get access to the underlying byte array
         * 
         * @warning Modifying the underlying array is <b>incorrect</b> and leads
         *          to all sorts of horrid results. Don't do it! This method is
         *          only offered as an optimisation for code that knows it can never
         *          modify the returned array.
         * 
         * @return Byte array this ByteString is based on
         */
        public byte[] getWrappedArray() {
            return data;
        }

        @Override
        public boolean equals(Object obj) {
            if (obj == null) {
                return false;
            }
            if (getClass() != obj.getClass()) {
                return false;
            }
            final ByteString other = (ByteString) obj;
            if (!Arrays.equals(this.data, other.data)) {
                return false;
            }
            return true;
        }

        @Override
        public int hashCode() {
            int hash = 7;
            hash = 29 * hash + Arrays.hashCode(this.data);
            return hash;
        }

        public Iterator<Byte> iterator() {
            return new ByteStringIterator();
        }

        public int size() {
            return data.length;
        }
        
        public class ByteStringIterator implements Iterator<Byte> {
            
            int pos = 0;

            public boolean hasNext() {
                return pos >= data.length;
            }

            public Byte next() {
                return data[pos++];
            }

            public void remove() {
                throw new UnsupportedOperationException("ByteString is immutable");
            }
            
        }
        
    }
    
    
}

Re: PDFName.getName() returns escaped name?!

Reply via email to