Here is the webrev for Phillip’s proposed change for the splash image
http://cr.openjdk.java.net/~lancea/6202130/splash/ <http://cr.openjdk.java.net/~lancea/6202130/splash/> > On Feb 10, 2020, at 4:22 PM, Lance Andersen <lance.ander...@oracle.com> wrote: > > Hi all, > > Here is a webrev for the patch that Philipp is proposing which will make it > easier to review: http://cr.openjdk.java.net/~lancea/6202130/webrev.00 > <http://cr.openjdk.java.net/~lancea/6202130/webrev.00> > >> On Dec 26, 2019, at 11:50 AM, Philipp Kunz <philipp.k...@paratix.ch> wrote: >> >> Hi, >> The specification says, a line break in a manifest can occur beforeor >> after a Unicode character encoded in UTF-8. >>> ...> value: SPACE *otherchar newline >> *continuation> continuation: SPACE *otherchar >> newline> ...> otherchar: any UTF-8 character except NUL, CR >> and LF >> The current implementation breaks manifest lines at 72 bytes regardless >> ofhow the bytes around the break are part of a sequence of bytes >> encoding acharacter. Code points may use up to four bytes when encoded >> in UTF-8.Manifests with line breaks inside of sequences of bytes >> encoding Unicodecharacters in UTF-8 with more than one bytes not only >> are invalid UTF-8but also look ugly in text editors.For example, a >> manifest could look like this: >> import java.util.jar.Manifest;import java.util.jar.Attributes;import >> static java.util.jar.Attributes.Name; >> public class CharacterBrokenDemo1 { public static void main(String[] >> args) throws Exception{ Manifest mf = new >> Manifest(); Attributes attrs = >> mf.getMainAttributes(); attrs.put(Name.MANIFEST_VERSION, >> "1.0"); attrs.put(new Name("Some-Key"), "Some >> languages have decorated characters, " + "for >> example: español"); // or >> "espa\u00D1ol" mf.write(System.out); }} >> Above code produces a result as follows with some unexpected question >> markswhere the encoding is invalid: >>> Manifest-Version: 1.0> Some-Key: Some languages have decorated >> characters, for example: espa?> ?ol >> This is of course an example written with actual question marks to get >> a validtext for this message. The trick here is that "Some-Key" to >> "example :espa"amounts to exactly one byte less encoded in UTF-8 than >> would fit on one linewith the 72 byte limit so that the subsequent >> character encoded with two bytesgets broken inside of the sequence of >> two bytes for this character across acontinuation line break. >> When decoding the resulting bytes from UTF-8 as one whole string, the >> twoquestion marks will not fit together again even if the line break >> with thecontinuation space is removed. However, Manifest::read removes >> the continuationline breaks ("\r\n ") before decoding the manifest >> header value from UTF-8 andhence can reproduce the original value. >> Characters encoded in UTF-8 can not only span up to four bytes for one >> codepoint each, there are also combining characters or classes thereof >> or combiningdiacritical marks or whatever the appropriate term could >> be, that combine morethan one code point into what is usually >> experienced and referred to as acharacter. >> The term character really gets ambiguous at this point. I wouldn't know >> whatthe specification actually refers to with that term "character". So >> rather thandiving in too much specification or any sorts of theory, >> let's look at anotherexample: >> import java.util.jar.Manifest;import java.util.jar.Attributes;import >> static java.util.jar.Attributes.Name; >> public class DemoCharacterBroken2 { public static void main(String[] >> args) throws Exception{ Manifest mf = new >> Manifest(); Attributes attrs = >> mf.getMainAttributes(); attrs.put(Name.MANIFEST_VERSION, >> "1.0"); attrs.put(new Name("Some-Key"), " ".repeat(53) + >> "Angstro\u0308m"); mf.write(System.out); }} >> which produces console output as follows: >>> Manifest-Version: 1.0> Some- >> Key: Angstro> >> ̈m >> (In case this does not display well, the diaeresis is on the m on the >> last line) >> When the whole Manifest is decoded from UTF-8 as one big single string >> andcontinuation line breaks are not removed until after UTF-8 decoding >> the wholemanifest, the diaeresis (umlaut, two points above, u0308) >> apparently kind ofjumps onto the following letter m because somehow it >> cannot be combined withthe preceding space. The UTF-8 decoder (all of >> my editors I tried, not onlyEclipse and its console view, also less, >> gedit, cat and terminal) somehowtries to fix that but the diaeresis may >> not necessarily jump back on the "o"where it originally belonged by >> removing the continuation line break ("\r\n ")after UTF-8 decoding has >> taken place, at least that did not work for me. >> Hence, ideally combining diacritical marks should better not be >> separated fromwhatever they combine with when breaking manifest lines >> onto a continuationline. Such combinations, however, seem to be >> unlimited in terms of number ofcode points combining into the same >> "experienced" character. I was able tofind combinations that not only >> exceed the limit of 72 bytes per line but alsoexceed the line buffer >> size of 512 bytes in Manifest::read. These may be ratheruncommon but >> still possible to my own surprise. >> Next consideration would then be to remove that limit of 512 bytes per >> manifestline but exceeding it would make such manifests incompatible >> with previousManifest::read implementations and is not really an >> immediately availableoption at the moment. >> As a compromise, those characters including combining diacritical marks >> whichcombine only so many code points as that their binarily encoded >> form in UTF-8remains within a limit of 71 bytes can be written without >> an interruptingcontinuation line break, which applies to most cases, >> but not all. I guess thisshould suit practically and realistically to >> be expected values well. >> Another possibility would be to allow for characters that are >> combinations ofmultiple Unicode code points to be kept together in >> their encoded form in UTF-8up to 512 bytes line length limit when >> reading minus a space and a line breakamounting to 509 bytes, but that >> would still not make manifests be representedas valid Unicode in all >> corner cases and I guess would not probably make a realimprovement in >> practice over a limit of 71 bytes. >> Attached is a patch that tries to implement what was described above >> using aBreakIterator. While it works from a functional point of view, >> this might beless desirable performance-wise. Alternatively could be >> considered to do withoutthe BreakIterator and only keep Unicode code >> points together by not placingline breaks before a continuation byte, >> which however would not addresscombining diacritical marks as in the >> second example above. >> The jar file specification does not explicitly state that manifest >> should bevalid UTF-8, and they were not always, but it also does not >> state otherwise,leaving an impression that manifests could be >> (mis)taken for UTF-8 encodedstrings, which they also are in many or >> most cases and which has been confusedmany times. At the moment, the >> only case where a valid manifest is not also avalid UTF-8 encoded >> string is when a sequence of bytes encoding the samecharacter happens >> to be interrupted with a continuation line break. To the bestof my >> knowledge, all other valid manifests are also valid UTF-8 encoded >> strings. >> It would be nice, if manifests could be viewed and manipulated with all >> Unicodecapable editors and not only be parsed correctly with >> Manifest::read. >> Any opinions? Would someone sponsor this patch? >> Regards,Philipp >> >> https://docs.oracle.com/en/java/javase/13/docs/specs/jar/jar.html#specificationhttps://bugs.openjdk.java.net/browse/JDK-6202130https://bugs.openjdk.java.net/browse/JDK-6443578https://github.com/gradle/gradle/issues/5225https://bugs.openjdk.java.net/browse/JDK-8202525https://en.wikipedia.org/wiki/Combining_character >> >> <6202130-manifestutf8linebreak.patch> > > <http://oracle.com/us/design/oracle-email-sig-198324.gif> > <http://oracle.com/us/design/oracle-email-sig-198324.gif> > <http://oracle.com/us/design/oracle-email-sig-198324.gif> > <http://oracle.com/us/design/oracle-email-sig-198324.gif>Lance Andersen| > Principal Member of Technical Staff | +1.781.442.2037 > Oracle Java Engineering > 1 Network Drive > Burlington, MA 01803 > lance.ander...@oracle.com <mailto:lance.ander...@oracle.com> > > > <http://oracle.com/us/design/oracle-email-sig-198324.gif> <http://oracle.com/us/design/oracle-email-sig-198324.gif> <http://oracle.com/us/design/oracle-email-sig-198324.gif> <http://oracle.com/us/design/oracle-email-sig-198324.gif>Lance Andersen| Principal Member of Technical Staff | +1.781.442.2037 Oracle Java Engineering 1 Network Drive Burlington, MA 01803 lance.ander...@oracle.com <mailto:lance.ander...@oracle.com>