Re: RFR for 6202130: java.util.jar.Attributes.writeMain() can't handle multi-byte chars

Philipp Kunz Fri, 04 May 2018 00:51:44 -0700

Hi Sherman,

Thank you very much for your reply. Your question led me to reconsiderif so many changes were actually necessary and they are really not.Thanks. I prepared a smaller patch for 6202130:


http://files.paratix.ch/jdk/6202130/webrev.02/
http://files.paratix.ch/jdk/6202130/webrev.02.zip

I fully agree that utf-8 + 72-bytes breaks have worked before(http://mail.openjdk.java.net/pipermail/core-libs-dev/2018-April/052688.html).However, a comment in the bug says: "But before we do that, we shouldhave a test". Breaking the lines is already covered inLineBreakLineWidth. And for utf support I wrote ValueUtfEncoding andwith that, provided the test is good, the comments can be removed.

Breaking lines and correct utf support in combination results in theadditional requirement not to break inside of multi-byte characters ontocontinuation lines. I confused it in my previous patch with the subjecthere, bug 6202130, but now I think that can be addressed separately inbug 6443578.


The version header, which you also pointed out,
"            if ((version != null) && !(name.equalsIgnoreCase(vername))) {"
is subject of other bugs 4935610, 6910466, 8196371, or 4271239.

Version headers are defined separately in the specification and writingthem is also implemented slightly differently than other headers.According to the specification, version headers can contain only digitsand periods and cannot be continued on the next line. Lines must notexceed a length and therefore I would conclude that version valuesshould have their own length limit. In the current implementationversion headers are never broken on continuation lines but they canexceed the maximum allowed line width which is not correct. Could thisbe worth a comment in a bug?

With the previous patch, I also tried to remove the use of deprecatedString(byte[], int, int, int). There also are duplicated portions ofcode nearby. Would it be, maybe if not just too crazy an idea, an optionto create another bug for these points?


Regards,
Philipp




On 03.05.2018 08:56, Xueming Shen wrote:

Philipp,
I kinda recalled JDK-6202130 which I read long time ago and I believedit's not a bug butjust wanted to leave it open and for double check, then it's off thepriority list somehow...
Reread the code below I think it handles utf8 and the 72-length-limitappropriately, though
a little tricky,

(1) value.getByte("utf8") + deprecated String(byte[], int, int, int);
which creates a String that each "char" inside that String objectactually represents one bytevalue of the resulted utf8 byte sequence, with its utf16 value =[hi-byte= 0 and low-byte="utf8-byte"]
(2) append this String to the StringBuffer (--> StringBuilder), nowthe sb.length() actually is thelength of the utf8 byte sequence, make72Safe(...) is used toadjust the length to 72 working
     on "chars" inside the buffer.

(3) write out the adjusted buffer via DataOutputStream.writeBytes()
in which the "write" cuts off the hi-byte of that utf16, so youactually get the original
     utf-8 byte sequence output to the stream.
Sure the implementation looks "interesting" and uses deprecated Stringconstructor, but it was
written 2 decades ago.
Back then I think we were thinking maybe the fix is to simply removethe "misleading" commentline below in the source code, if the implementation actually supportsutf-8 + 72-limit.
" * XXX Need to handle UTF8 values and break up lines longer than72 bytes"
Any reason/test case to believe the mechanism does not work?
BTW the logic of line below in writeMain() appears to be incorrect. Ido have another bug for it." if ((version != null) &&!(name.equalsIgnoreCase(vername))) {"
----------------------

                StringBuffer buffer = new StringBuffer(name);
                buffer.append(": ");

                String value = (String) e.getValue();
                if (value != null) {
                    byte[] vb = value.getBytes("UTF8");
                    value = new String(vb, 0, 0, vb.length);
                }
                buffer.append(value);

                Manifest.make72Safe(buffer);
                buffer.append("\r\n");
                out.writeBytes(buffer.toString());

-------------------------

sherman



On 5/1/18, 10:21 PM, Philipp Kunz wrote:
Hi,
Recently, I tried to fix only bug 6202130 with the intention to fixbug 6443578 later with the intention to get some opportunity forfeedback, but haven't got any, and propose now a fix for bothtogether which in my opinion makes more sense.
See attached patch.

Some considerations, assumptions, and explanations

 * In my opinion, the code for writing manifests was distributed in the
   two classes Attributes and Manifest in an elegant way but somewhat
   difficult to explain the coherence. I chose to group the code that
   writes manifests into a new class ManifestWriter. The main incentive
   for that was to prevent or reduce duplicated code I would have had
   to change twice otherwise. This also results in a source file of a
   suitable size.
 * I could not support the assumption that the write and writeMain
   methods in Attributes couldn't be referenced anywhere so I
   deprecated them rather than having them removed.
 * I assumed the patch will not make it into JDK 10 and, hence, the
   deprecated annotations are attributed with since = 11.
 * I could not figure out any reason for the use of DataOutputStream
   and did not use it.
 * Performance-wise I assume that the code is approximately comparable
   to the previous version. The biggest improvement in this respect I
   hope comes from removing the String that contains the byte array
   constructed with deprecated String(byte[], int, int, int) and then
   copying it over again to a StringBuffer and from there to a String
   again and then Characters. On the other hand, keeping whole
   characters together when breaking lines might make it slightly
   slower. I hope my changes are an overall improvement, but I haven't
   measured it.
 * For telling first from continuation bytes of utf-8 characters apart
   I re-used a method isNotUtfContinuationByte from either StringCoding
   or UTF_8.Decoder. Unfortunately I found no way not to duplicate it.
 * Where it said before "XXX Need to handle UTF8 values and break up
   lines longer than 72 bytes" in Attributes#writeMain I did not dare
   to remove the comment completely because it still does not deal
   correctly with version headers longer than 72 bytes and the set of
   allowed values. I changed it accordingly. Two similar comments are
   removed in the patch.
 * I added two tests, WriteDeprecated and NullKeysAndValues, to
   demonstrate compatibility as good as I could. Might however not be
   desired to keep and having to maintain.
 * LineBrokenMultiByteCharacter for jarsigner should not be removed or
   not so immediately because someone might attempt to sign an older
   jarfile created without that patch with a newer jarsigner that
   already contains it.
suggested changes or additions to the bug database: (i have nopermissions to edit it myself)
 * Re-combine copies of isNotUtfContinuationByte (three by now).
   Relates to 6184334. Worth to file another issue?
 * Manifest versions have specific specifications, cannot break across
   lines and can contain a subset of characters only. Bug 6910466
   relates but is not exactly the same. If someone else is convinced
   that writing a manifest should issue a warning or any other way to
   deal with a version that does not conform to the specification, I'd
   suggest to create a separate bug for that.
Now, I would be glad if someone sponsored a review. This is only mythird attempt to submit a patch which is why I chose a lesserimportant subject to fix in order to get familiar and now Iunderstand it's not the most attractive patch to review. Please don'thesitate to suggest what I could do better or differently.
As a bonus, with these changes, manifest files will always bedisplayed correctly with just any utf capable viewer even if theycontain multi-byte utf characters that would have been broken acrossa line break with the current/previous implementation and allmanifests will become also valid strings in Java.

diff -r 2ace90aec488 src/java.base/share/classes/java/util/jar/Attributes.java
--- a/src/java.base/share/classes/java/util/jar/Attributes.java	Mon Apr 30 21:56:54 2018 -0400
+++ b/src/java.base/share/classes/java/util/jar/Attributes.java	Fri May 04 09:45:10 2018 +0200
@@ -296,7 +296,6 @@
 
     /*
      * Writes the current attributes to the specified data output stream.
-     * XXX Need to handle UTF8 values and break up lines longer than 72 bytes
      */
      @SuppressWarnings("deprecation")
      void write(DataOutputStream os) throws IOException {
@@ -323,8 +322,6 @@
      * Writes the current attributes to the specified data output stream,
      * make sure to write out the MANIFEST_VERSION or SIGNATURE_VERSION
      * attributes first.
-     *
-     * XXX Need to handle UTF8 values and break up lines longer than 72 bytes
      */
     @SuppressWarnings("deprecation")
     void writeMain(DataOutputStream out) throws IOException
@@ -367,7 +364,6 @@
 
     /*
      * Reads attributes from the specified input stream.
-     * XXX Need to handle UTF8 values.
      */
     @SuppressWarnings("deprecation")
     void read(Manifest.FastInputStream is, byte[] lbuf) throws IOException {
diff -r 2ace90aec488 test/jdk/java/util/jar/Manifest/ValueUtfEncoding.java
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/test/jdk/java/util/jar/Manifest/ValueUtfEncoding.java	Fri May 04 09:45:10 2018 +0200
@@ -0,0 +1,223 @@
+/*
+ * Copyright (c) 2018, Oracle and/or its affiliates. All rights reserved.
+ * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER.
+ *
+ * This code is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 only, as
+ * published by the Free Software Foundation.
+ *
+ * This code is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+ * version 2 for more details (a copy is included in the LICENSE file that
+ * accompanied this code).
+ *
+ * You should have received a copy of the GNU General Public License version
+ * 2 along with this work; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA.
+ *
+ * Please contact Oracle, 500 Oracle Parkway, Redwood Shores, CA 94065 USA
+ * or visit www.oracle.com if you need additional information or have any
+ * questions.
+ */
+
+import static java.nio.charset.StandardCharsets.UTF_8;
+
+import java.io.ByteArrayInputStream;
+import java.io.ByteArrayOutputStream;
+import java.io.IOException;
+import java.util.jar.Attributes;
+import java.util.jar.Attributes.Name;
+import java.util.jar.Manifest;
+import java.util.List;
+import java.util.ArrayList;
+
+import org.testng.annotations.Test;
+import static org.testng.Assert.*;
+
+/**
+ * @test
+ * @bug 6202130
+ * @run testng ValueUtfEncoding
+ * @summary Tests complete manifest values utf encoding
+ * <p>
+ * This test writes and reads a manifest that contains every valid utf
+ * character (three times), grouped into manifest header values with about
+ * 65535 bytes each or slightly more, resulting in a single huge manifest with
+ * 3 * 67 + 1 values and 13703968 bytes in the manifest's encoded form in
+ * total. This way, all possible 1111995 utf characters are covered in one
+ * manifest.
+ * <p>
+ * Every character occurs three times, once in a main attribute value, once in
+ * a section name, and once in a named section attribute value, because
+ * implementation of writing the main section headers differs from the one
+ * writing named section headers in
+ * {@link Attributes#writeMain(java.io.DataOutputStream)} and
+ * {@link Attributes#write(java.io.DataOutputStream)} due to special order of
+ * {@link Name#MANIFEST_VERSION} and {@link Name#SIGNATURE_VERSION}.
+ * and also {@link Manifest#read(java.io.InputStream)} treating reading the
+ * main section differently from reading named sections names headers.
+ * <p>
+ * Only header values are tested. Characters for header names are much more
+ * limited and very simple ones are used just to get valid and different ones.
+ */
+public class ValueUtfEncoding {
+
+    /**
+     * From the specifications:
+     * <q>Implementations should support 65535-byte (not character) header
+     * values, and 65535 headers per file. They might run out of memory,
+     * but there should not be hard-coded limits below these values.</q>
+     *
+     * @see <a href="{@docRoot}/../specs/jar/jar.html#Notes_on_Manifest_and_Signature_Files">Notes on Manifest and Signature Files</a>
+     */
+    static final int MIN_VALUE_LENGTH_SUPPORTED = 2 << 16 - 1;
+
+    static final int MAX_UTF_CHARACTER_ENCODED_LENGTH = 4;
+
+    static boolean isValidUtfCharacter(int codePoint) {
+        if (0xFDD0 <= codePoint && codePoint <= 0xFDEF) {
+            return false; /* non-characters */
+        }
+        if ((codePoint & 0xFFFE) == 0xFFFE) {
+            return false; /* byte order marks */
+        }
+        return true;
+    }
+
+    /**
+     * returns {@code true} if {@code codePoint} is explicitly forbidden in
+     * manifest values based on a statement from the specs:
+     * <pre>otherchar: any UTF-8 character except NUL, CR and LF<pre>
+     *
+     * @see <a href="{@docRoot}/../specs/jar/jar.html#Section-Specification">Jar File Specification</a>
+     */
+    static boolean isInvalidManifestValueCharacter(int codePoint) {
+        return codePoint == 0 /* NUL */
+            || codePoint == '\r' /* CR */
+            || codePoint == '\n' /* LF */;
+    };
+
+    /**
+     * Produces a list of strings with all known utf characters except those
+     * invalid in manifest header values with at least
+     * {@link #MIN_VALUE_LENGTH_SUPPORTED} utf-8 encoded bytes each
+     * except the last string which contains just the remaining characters.
+     */
+    static List<String> produceValidManifestUtfCharacterValues() {
+        int maxLengthBytes = MIN_VALUE_LENGTH_SUPPORTED +
+                // exceed the specified limit by at least one character
+                MAX_UTF_CHARACTER_ENCODED_LENGTH + 1;
+
+        int numberOfUsedCodePoints = 0;
+        ArrayList<String> values = new ArrayList<>();
+        byte[] valueBuf = new byte[maxLengthBytes];
+        int pos = 0;
+        for (int codePoint = Character.MIN_CODE_POINT;
+                codePoint <= Character.MAX_CODE_POINT; codePoint++) {
+            if (!isValidUtfCharacter(codePoint)) {
+                continue;
+            }
+            if (isInvalidManifestValueCharacter(codePoint)) {
+                continue;
+            }
+            numberOfUsedCodePoints++;
+
+            byte[] charBuf =
+                    new String(Character.toChars(codePoint)).getBytes(UTF_8);
+            if (pos + charBuf.length > valueBuf.length) {
+                values.add(new String(valueBuf, 0, pos, UTF_8));
+                pos = 0;
+            }
+            System.arraycopy(charBuf, 0, valueBuf, pos, charBuf.length);
+            pos += charBuf.length;
+        }
+        if (pos > 0) {
+            values.add(new String(valueBuf, 0, pos, UTF_8));
+        }
+
+        if (numberOfUsedCodePoints !=
+                (17 << 16) /* utf space */
+                - 66 /* non-characters */
+                - 3 /* nul, cr, lf */) {
+            fail("self-test: utf character set not covered exactly");
+        }
+
+        return values;
+    }
+
+    /**
+     * returns simple, valid, short, and distinct manifest header names.
+     * The returned name cannot be "{@code Manifest-Version}" because the
+     * returned string does not contain "{@code -}".
+     *
+     * @param seed seed to produce distinct names
+     */
+    static String azName(int seed) {
+        StringBuffer name = new StringBuffer();
+        do {
+            name.insert(0, (char) (seed % 26 + (seed < 26 ? 'A' : 'a')));
+            seed = seed / 26 - 1;
+        } while (seed >= 0);
+        return name.toString();
+    }
+
+    /**
+     * covers writing and reading of manifests with all known utf characters.
+     *
+     * Because the implementation used different portions of code depending on
+     * where the value occurs to read or write in earlier versions, each
+     * character is tested in each of the three positions:<ul>
+     * <li>main attribute header,</li>
+     * <li>named section name, which is in fact a header value after a blank
+     * line, and</li>
+     * <li>named sections header values</li>
+     * <ul>
+     */
+    @Test
+    public void testValueUtfEncoding() throws IOException {
+        Manifest mf = new Manifest();
+        mf.getMainAttributes().put(Name.MANIFEST_VERSION, "1.0");
+
+        List<String> values = produceValidManifestUtfCharacterValues();
+        for (int i = 0; i < values.size(); i++) {
+            String name = azName(i);
+            String value = values.get(i);
+
+            mf.getMainAttributes().put(new Name(name), value);
+            Attributes attributes = new Attributes();
+            mf.getEntries().put(value, attributes);
+            attributes.put(new Name(name), value);
+        }
+
+        mf = writeAndRead(mf);
+
+        for (int i = 0; i < values.size(); i++) {
+            String value = values.get(i);
+            String name = azName(i);
+
+            assertEquals(mf.getMainAttributes().getValue(name), value,
+                    "main attributes header value");
+            Attributes attributes = mf.getAttributes(value);
+            assertNotNull(attributes, "named section not found");
+            assertEquals(attributes.getValue(name), value,
+                    "named section attributes value");
+        }
+    }
+
+    static Manifest writeAndRead(Manifest mf) throws IOException {
+        ByteArrayOutputStream out = new ByteArrayOutputStream();
+        mf.write(out);
+        byte[] mfBytes = out.toByteArray();
+        
+        System.out.println("-------------------------------------------"
+                + "-----------------------------");
+        System.out.print(new String(mfBytes, UTF_8));
+        System.out.println("-------------------------------------------"
+                + "-----------------------------");
+
+        ByteArrayInputStream in = new ByteArrayInputStream(mfBytes);
+        return new Manifest(in);
+    }
+
+}

Re: RFR for 6202130: java.util.jar.Attributes.writeMain() can't handle multi-byte chars

Reply via email to