Re: Review Request: 6202130: java.util.jar.Attributes.writeMain() can't handle multi-byte chars

Philipp Kunz Mon, 22 Oct 2018 14:57:56 -0700

Please find another revised and updated patch attached.


On Sun, 2018-10-21 at 15:10 -0700, Martin Buchholz wrote:
> I only took a quick look.
> Looks good, but here's a nitpick - capitalize javadoc that begins
> with "returns"
> On Fri, Oct 19, 2018 at 1:27 AM, Philipp Kunz <[email protected]
> h> wrote:
> > Hi Martin and everyone,
> > 
> > 
> > 
> > You were absolutely right to object "utf".
> > 
> > Please find a revised and updated patch attached.
> > 
> > 
> > 
> > Regards,
> > 
> > Philipp
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > On Thu, 2018-10-04 at 15:24 -0700, Martin Buchholz wrote:
> > 
> > > "utf8" is a thing.  "utf" is not.
> > 
> > > 
> > 
> > > On Thu, Oct 4, 2018 at 12:58 PM, Philipp Kunz <philipp.kunz@parat
> > ix.c
> > 
> > > h> wrote:
> > 
> > > > Thanks to Sherman's hint, I revised the test to use the terms
> > 
> > > > unicode
> > 
> > > > character and utf encoding consistently and not utf character.
> > 
> > > > Affects
> > 
> > > > only the test, mostly in comments and a few identifiers.
> > 
> > > > 
> > 
> > > > Philipp
> > 
> > > > 
> > 
> > > > 
> > 
> > > > On Wed, 2018-10-03 at 12:48 -0700, Xueming Shen wrote:
> > 
> > > > > +1
> > 
> > > > > 
> > 
> > > > > only nit is it might be preferred to use "utf8" directly in
> > the
> > 
> > > > > test
> > 
> > > > > instead of current "utf"
> > 
> > > > > 
> > 
> > > > > -sherman
> > 
> > > > > 
> > 
> > > > > On 10/03/2018 11:55 AM, Philipp Kunz wrote:
> > 
> > > > > > In short, bug 6202130 is about removal of comments in
> > 
> > > > > > Attributes
> > 
> > > > > > like
> > 
> > > > > > 
> > 
> > > > > > > XXX Need to handle UTF8 values and break up lines longer
> > than
> > 
> > > > > > > 72
> > 
> > > > > > 
> > 
> > > > > > The patch here contains a test that shows that utf
> > characters
> > 
> > > > > > are
> > 
> > > > > > handled correctly, which has been the case before. Breaking
> > 
> > > > > > lines
> > 
> > > > > > has
> > 
> > > > > > also been working before.
> > 
> > > > > > 
> > 
> > > > > > In the main code, the patch only removes a few comments and
> > 
> > > > > > does
> > 
> > > > > > not
> > 
> > > > > > change actual code. It all has been working before and now
> > we
> > 
> > > > > > know.
> > 
> > > > > > 
> > 
> > > > > > Philipp
> > 
> > > > > > 
> > 
> > > > > > 
> > 
> > > > > > https://bugs.openjdk.java.net/browse/JDK-6202130
> > 
> > > > > > 
> > 
> > > > > > 
> > 
> > > > > > 
> > 
> > > > > > 
> > 
> > > > > > On Wed, 2018-09-26 at 07:08 +0200, Philipp Kunz wrote:
> > 
> > > > > > > Hi,
> > 
> > > > > > > 
> > 
> > > > > > > There are some comments in the source code like this:
> > 
> > > > > > > 
> > 
> > > > > > >   >  XXX Need to handle UTF8 values and break up lines
> > longer
> > 
> > > > > > > than 72 bytes
> > 
> > > > > > > 
> > 
> > > > > > > in Attributes.java on lines 299, 327, and 370. Bug
> > 6202130
> > 
> > > > > > > also
> > 
> > > > > > > suggests
> > 
> > > > > > > to add a test to show that the character set is fully
> > 
> > > > > > > supported
> > 
> > > > > > > before
> > 
> > > > > > > actually removing those comments.
> > 
> > > > > > > 
> > 
> > > > > > > Hence, I added such a test, see ValueUtfEncoding in
> > attached
> > 
> > > > > > > patch, and
> > 
> > > > > > > it finds the complete utf character set supported
> > correctly
> > 
> > > > > > > and
> > 
> > > > > > > that it
> > 
> > > > > > > has been so already before.
> > 
> > > > > > > 
> > 
> > > > > > > The comments in Attributes also say that lines longer
> > than 72
> > 
> > > > > > > bytes need
> > 
> > > > > > > to be broken. That is a different issue and has been
> > already
> > 
> > > > > > > addressed
> > 
> > > > > > > and solved in bug 6372077 and is now tested in
> > 
> > > > > > > LineBreakLineWidth
> > 
> > > > > > > and
> > 
> > > > > > > therefore, that line break part of the comment is
> > obsolete
> > 
> > > > > > > now as
> > 
> > > > > > > well
> > 
> > > > > > > with the exception of version headers ("Manifest-Version" 
> > or
> > 
> > > > > > > "Signature-Version") at least when taking strictly.
> > 
> > > > > > > 
> > 
> > > > > > > Version headers are not subject of bug 6202130 for which
> > my
> > 
> > > > > > > patch
> > 
> > > > > > > intends to provide a fix and if only because multi-byte
> > 
> > > > > > > characters are
> > 
> > > > > > > not allowed in version headers and wasn't subject of
> > 6372077
> > 
> > > > > > > either
> > 
> > > > > > > which was about too short lines rather than too long
> > ones. If
> > 
> > > > > > > I
> > 
> > > > > > > would
> > 
> > > > > > > change only the minimum I could argue was safe, there
> > would
> > 
> > > > > > > have
> > 
> > > > > > > to
> > 
> > > > > > > remain a comment in Attributes.java on line 327 like
> > 
> > > > > > > 
> > 
> > > > > > >   >  XXX Need to handle version headers longer than 72
> > bytes
> > 
> > > > > > > 
> > 
> > > > > > > but I guess that should probably better be represented as
> > 
> > > > > > > another
> > 
> > > > > > > bug
> > 
> > > > > > > rather than this style of comment if at all desired.
> > There
> > 
> > > > > > > are
> > 
> > > > > > > some bugs
> > 
> > > > > > > concerning version headers and particularly the behavior
> > of
> > 
> > > > > > > the
> > 
> > > > > > > manifests in their absence, among possibly others,
> > 6910466,
> > 
> > > > > > > 4935610,
> > 
> > > > > > > 4271239, 8196371 and 4256580 which have some relation but
> > I
> > 
> > > > > > > haven't
> > 
> > > > > > > found one that really fits the 72 byte line length limit
> > of
> > 
> > > > > > > version
> > 
> > > > > > > headers. Whether such a bug is created or found or not,
> > it
> > 
> > > > > > > does
> > 
> > > > > > > not
> > 
> > > > > > > directly relate to the one I'm trying to fix here,
> > 6202130.
> > 
> > > > > > > 
> > 
> > > > > > > If also the test is good I would assume that the comments
> > can
> > 
> > > > > > > be
> > 
> > > > > > > removed now without actually changing the Manifest and
> > 
> > > > > > > Attributes
> > 
> > > > > > > implementation.
> > 
> > > > > > > 
> > 
> > > > > > > Regards,
> > 
> > > > > > > Philipp
> 
>

diff -r 5e3a8f387701 src/java.base/share/classes/java/util/jar/Attributes.java
--- a/src/java.base/share/classes/java/util/jar/Attributes.java	Mon Oct 22 13:31:42 2018 -0700
+++ b/src/java.base/share/classes/java/util/jar/Attributes.java	Mon Oct 22 23:54:42 2018 +0200
@@ -296,7 +296,6 @@
 
     /*
      * Writes the current attributes to the specified data output stream.
-     * XXX Need to handle UTF8 values and break up lines longer than 72 bytes
      */
      @SuppressWarnings("deprecation")
      void write(DataOutputStream os) throws IOException {
@@ -323,8 +322,6 @@
      * Writes the current attributes to the specified data output stream,
      * make sure to write out the MANIFEST_VERSION or SIGNATURE_VERSION
      * attributes first.
-     *
-     * XXX Need to handle UTF8 values and break up lines longer than 72 bytes
      */
     @SuppressWarnings("deprecation")
     void writeMain(DataOutputStream out) throws IOException
@@ -367,7 +364,6 @@
 
     /*
      * Reads attributes from the specified input stream.
-     * XXX Need to handle UTF8 values.
      */
     void read(Manifest.FastInputStream is, byte[] lbuf) throws IOException {
         read(is, lbuf, null, 0);
diff -r 5e3a8f387701 test/jdk/java/util/jar/Manifest/ValueUtf8Coding.java
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/test/jdk/java/util/jar/Manifest/ValueUtf8Coding.java	Mon Oct 22 23:54:42 2018 +0200
@@ -0,0 +1,202 @@
+/*
+ * Copyright (c) 2018, Oracle and/or its affiliates. All rights reserved.
+ * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER.
+ *
+ * This code is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 only, as
+ * published by the Free Software Foundation.
+ *
+ * This code is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+ * version 2 for more details (a copy is included in the LICENSE file that
+ * accompanied this code).
+ *
+ * You should have received a copy of the GNU General Public License version
+ * 2 along with this work; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA.
+ *
+ * Please contact Oracle, 500 Oracle Parkway, Redwood Shores, CA 94065 USA
+ * or visit www.oracle.com if you need additional information or have any
+ * questions.
+ */
+
+import static java.nio.charset.StandardCharsets.UTF_8;
+
+import java.io.ByteArrayInputStream;
+import java.io.ByteArrayOutputStream;
+import java.io.IOException;
+import java.util.jar.Attributes;
+import java.util.jar.Attributes.Name;
+import java.util.jar.Manifest;
+import java.util.List;
+import java.util.ArrayList;
+
+import org.testng.annotations.Test;
+import static org.testng.Assert.*;
+
+/**
+ * @test
+ * @bug 6202130
+ * @run testng ValueUtf8Coding
+ * @summary Tests that UTF-8 encoding and decoding manifest header values
+ * works well with the complete unicode character set.
+ */
+public class ValueUtf8Coding {
+
+    /**
+     * Maximum number of bytes of UTF-8 encoded characters in one header value.
+     * <p>
+     * There are too many unicode characters (code points) to fit all into one
+     * manifest value. The specifications state:
+     * <q>Implementations should support 65535-byte (not character) header
+     * values, and 65535 headers per file. They might run out of memory,
+     * but there should not be hard-coded limits below these values.</q>
+     *
+     * @see <a
+     * href="{@docRoot}/../specs/jar/jar.html#Notes_on_Manifest_and_Signature_Files">
+     * Notes on Manifest and Signature Files</a>
+     */
+    static final int SUPPORTED_VALUE_LENGTH = 65535;
+
+    /**
+     * Returns {@code true} if {@code codePoint} is known not to be a supported
+     * character in manifest header values. Explicitly forbidden in manifest
+     * header values are according to a statement from the specifications:
+     * <q>otherchar: any UTF-8 character except NUL, CR and LF</q>.
+     * {@code NUL} ({@code 0x0}), however, works just fine and might have been
+     * used and might still be.
+     *
+     * @see <a href="{@docRoot}/../specs/jar/jar.html#Section-Specification">
+     * Jar File Specification</a>
+     */
+    static boolean isUnsupportedManifestValueCharacter(int codePoint) {
+        return codePoint == '\r' /* CR */ || codePoint == '\n' /* LF */;
+    };
+
+    /**
+     * Produces a list of strings with all unicode characters except those
+     * explicitly invalid in manifest header values.
+     * Each string is filled with as many characters as fit into
+     * {@link #SUPPORTED_VALUE_LENGTH} bytes with UTF-8 encoding except the
+     * last string which contains the remaining characters.
+     */
+    static List<String> produceUnicodeCharacterValues() {
+        ArrayList<String> values = new ArrayList<>();
+        byte[] valueBuf = new byte[SUPPORTED_VALUE_LENGTH];
+        int pos = 0;
+        for (int codePoint = Character.MIN_CODE_POINT;
+                codePoint <= Character.MAX_CODE_POINT; codePoint++) {
+            if (isUnsupportedManifestValueCharacter(codePoint)) {
+                continue;
+            }
+
+            byte[] charBuf = Character.toString(codePoint).getBytes(UTF_8);
+            if (pos + charBuf.length > valueBuf.length) {
+                values.add(new String(valueBuf, 0, pos, UTF_8));
+                pos = 0;
+            }
+            System.arraycopy(charBuf, 0, valueBuf, pos, charBuf.length);
+            pos += charBuf.length;
+        }
+        if (pos > 0) {
+            values.add(new String(valueBuf, 0, pos, UTF_8));
+        }
+        return values;
+    }
+
+    /**
+     * Returns simple, valid, short, and distinct manifest header names.
+     * The returned name cannot collide with "{@code Manifest-Version}" because
+     * the returned string does not contain "{@code -}".
+     */
+    static Name azName(int seed) {
+        StringBuffer name = new StringBuffer();
+        do {
+            name.insert(0, (char) (seed % 26 + (seed < 26 ? 'A' : 'a')));
+            seed = seed / 26 - 1;
+        } while (seed >= 0);
+        return new Name(name.toString());
+    }
+
+    /**
+     * Writes and reads a manifest with the complete unicode character set.
+     * The characters are grouped into manifest header values with about as
+     * many bytes as allowed each, utilizing a single big manifest.
+     * <p>
+     * This test assumes that a manifest is encoded and decoded correctly if
+     * writing and then reading it again results in a manifest with identical
+     * values as the original. The test is not about other aspects of writing
+     * and reading manifests than only that, given the fact and the way it
+     * works for some characters such as the most widely and often used ones,
+     * it also works for the complete unicode character set just the same.
+     * <p>
+     * Only header values are tested. The set of allowed characters for header
+     * names are much more limited and are a different topic entirely and most
+     * simple ones are used here as necessary just to get valid and different
+     * ones (see {@link #azName}).
+     * <p>
+     * Because the current implementation under test uses different portions
+     * of code depending on where the value occurs to read or write, each
+     * character is tested in each of the three positions:<ul>
+     * <li>main attribute header,</li>
+     * <li>named section name, and</li>
+     * <li>named sections header values</li>
+     * </ul>
+     * Implementation of writing the main section headers in
+     * {@link Attributes#writeMain(java.io.DataOutputStream)} differs from the
+     * one writing named section headers in
+     * {@link Attributes#write(java.io.DataOutputStream)} regarding the special
+     * order of {@link Name#MANIFEST_VERSION} and
+     * {@link Name#SIGNATURE_VERSION} and also
+     * {@link Manifest#read(java.io.InputStream)} at least potentially reads
+     * main sections differently than reading named sections names headers in
+     * {@link Attributes#read(Manifest.FastInputStream, byte[])}.
+     */
+    @Test
+    public void testUnicodeCharacterSet() throws IOException {
+        Manifest mf = new Manifest();
+        mf.getMainAttributes().put(Name.MANIFEST_VERSION, "1.0");
+
+        List<String> values = produceUnicodeCharacterValues();
+        for (int i = 0; i < values.size(); i++) {
+            Name name = azName(i);
+            String value = values.get(i);
+
+            mf.getMainAttributes().put(name, value);
+            Attributes attributes = new Attributes();
+            mf.getEntries().put(value, attributes);
+            attributes.put(name, value);
+        }
+
+        mf = writeAndRead(mf);
+
+        for (int i = 0; i < values.size(); i++) {
+            String value = values.get(i);
+            Name name = azName(i);
+
+            assertEquals(mf.getMainAttributes().getValue(name), value,
+                    "main attributes header value");
+            Attributes attributes = mf.getAttributes(value);
+            assertNotNull(attributes, "named section");
+            assertEquals(attributes.getValue(name), value,
+                    "named section attributes value");
+        }
+    }
+
+    static Manifest writeAndRead(Manifest mf) throws IOException {
+        ByteArrayOutputStream out = new ByteArrayOutputStream();
+        mf.write(out);
+        byte[] mfBytes = out.toByteArray();
+
+        String sepLine = "----------------------------------------------"
+                + "--------------------------";
+        System.out.println(sepLine);
+        System.out.print(new String(mfBytes, UTF_8));
+        System.out.println(sepLine);
+
+        ByteArrayInputStream in = new ByteArrayInputStream(mfBytes);
+        return new Manifest(in);
+    }
+
+}

Re: Review Request: 6202130: java.util.jar.Attributes.writeMain() can't handle multi-byte chars

Reply via email to