Hi all, Here is a webrev for the patch that Philipp is proposing which will make it easier to review: http://cr.openjdk.java.net/~lancea/6202130/webrev.00 <http://cr.openjdk.java.net/~lancea/6202130/webrev.00>
> On Dec 26, 2019, at 11:50 AM, Philipp Kunz <philipp.k...@paratix.ch> wrote: > > Hi, > The specification says, a line break in a manifest can occur beforeor > after a Unicode character encoded in UTF-8. >> ...> value: SPACE *otherchar newline > *continuation> continuation: SPACE *otherchar > newline> ...> otherchar: any UTF-8 character except NUL, CR > and LF > The current implementation breaks manifest lines at 72 bytes regardless > ofhow the bytes around the break are part of a sequence of bytes > encoding acharacter. Code points may use up to four bytes when encoded > in UTF-8.Manifests with line breaks inside of sequences of bytes > encoding Unicodecharacters in UTF-8 with more than one bytes not only > are invalid UTF-8but also look ugly in text editors.For example, a > manifest could look like this: > import java.util.jar.Manifest;import java.util.jar.Attributes;import > static java.util.jar.Attributes.Name; > public class CharacterBrokenDemo1 { public static void main(String[] > args) throws Exception{ Manifest mf = new > Manifest(); Attributes attrs = > mf.getMainAttributes(); attrs.put(Name.MANIFEST_VERSION, > "1.0"); attrs.put(new Name("Some-Key"), "Some > languages have decorated characters, " + "for > example: español"); // or > "espa\u00D1ol" mf.write(System.out); }} > Above code produces a result as follows with some unexpected question > markswhere the encoding is invalid: >> Manifest-Version: 1.0> Some-Key: Some languages have decorated > characters, for example: espa?> ?ol > This is of course an example written with actual question marks to get > a validtext for this message. The trick here is that "Some-Key" to > "example :espa"amounts to exactly one byte less encoded in UTF-8 than > would fit on one linewith the 72 byte limit so that the subsequent > character encoded with two bytesgets broken inside of the sequence of > two bytes for this character across acontinuation line break. > When decoding the resulting bytes from UTF-8 as one whole string, the > twoquestion marks will not fit together again even if the line break > with thecontinuation space is removed. However, Manifest::read removes > the continuationline breaks ("\r\n ") before decoding the manifest > header value from UTF-8 andhence can reproduce the original value. > Characters encoded in UTF-8 can not only span up to four bytes for one > codepoint each, there are also combining characters or classes thereof > or combiningdiacritical marks or whatever the appropriate term could > be, that combine morethan one code point into what is usually > experienced and referred to as acharacter. > The term character really gets ambiguous at this point. I wouldn't know > whatthe specification actually refers to with that term "character". So > rather thandiving in too much specification or any sorts of theory, > let's look at anotherexample: > import java.util.jar.Manifest;import java.util.jar.Attributes;import > static java.util.jar.Attributes.Name; > public class DemoCharacterBroken2 { public static void main(String[] > args) throws Exception{ Manifest mf = new > Manifest(); Attributes attrs = > mf.getMainAttributes(); attrs.put(Name.MANIFEST_VERSION, > "1.0"); attrs.put(new Name("Some-Key"), " ".repeat(53) + > "Angstro\u0308m"); mf.write(System.out); }} > which produces console output as follows: >> Manifest-Version: 1.0> Some- > Key: Angstro> > ̈m > (In case this does not display well, the diaeresis is on the m on the > last line) > When the whole Manifest is decoded from UTF-8 as one big single string > andcontinuation line breaks are not removed until after UTF-8 decoding > the wholemanifest, the diaeresis (umlaut, two points above, u0308) > apparently kind ofjumps onto the following letter m because somehow it > cannot be combined withthe preceding space. The UTF-8 decoder (all of > my editors I tried, not onlyEclipse and its console view, also less, > gedit, cat and terminal) somehowtries to fix that but the diaeresis may > not necessarily jump back on the "o"where it originally belonged by > removing the continuation line break ("\r\n ")after UTF-8 decoding has > taken place, at least that did not work for me. > Hence, ideally combining diacritical marks should better not be > separated fromwhatever they combine with when breaking manifest lines > onto a continuationline. Such combinations, however, seem to be > unlimited in terms of number ofcode points combining into the same > "experienced" character. I was able tofind combinations that not only > exceed the limit of 72 bytes per line but alsoexceed the line buffer > size of 512 bytes in Manifest::read. These may be ratheruncommon but > still possible to my own surprise. > Next consideration would then be to remove that limit of 512 bytes per > manifestline but exceeding it would make such manifests incompatible > with previousManifest::read implementations and is not really an > immediately availableoption at the moment. > As a compromise, those characters including combining diacritical marks > whichcombine only so many code points as that their binarily encoded > form in UTF-8remains within a limit of 71 bytes can be written without > an interruptingcontinuation line break, which applies to most cases, > but not all. I guess thisshould suit practically and realistically to > be expected values well. > Another possibility would be to allow for characters that are > combinations ofmultiple Unicode code points to be kept together in > their encoded form in UTF-8up to 512 bytes line length limit when > reading minus a space and a line breakamounting to 509 bytes, but that > would still not make manifests be representedas valid Unicode in all > corner cases and I guess would not probably make a realimprovement in > practice over a limit of 71 bytes. > Attached is a patch that tries to implement what was described above > using aBreakIterator. While it works from a functional point of view, > this might beless desirable performance-wise. Alternatively could be > considered to do withoutthe BreakIterator and only keep Unicode code > points together by not placingline breaks before a continuation byte, > which however would not addresscombining diacritical marks as in the > second example above. > The jar file specification does not explicitly state that manifest > should bevalid UTF-8, and they were not always, but it also does not > state otherwise,leaving an impression that manifests could be > (mis)taken for UTF-8 encodedstrings, which they also are in many or > most cases and which has been confusedmany times. At the moment, the > only case where a valid manifest is not also avalid UTF-8 encoded > string is when a sequence of bytes encoding the samecharacter happens > to be interrupted with a continuation line break. To the bestof my > knowledge, all other valid manifests are also valid UTF-8 encoded > strings. > It would be nice, if manifests could be viewed and manipulated with all > Unicodecapable editors and not only be parsed correctly with > Manifest::read. > Any opinions? Would someone sponsor this patch? > Regards,Philipp > > https://docs.oracle.com/en/java/javase/13/docs/specs/jar/jar.html#specificationhttps://bugs.openjdk.java.net/browse/JDK-6202130https://bugs.openjdk.java.net/browse/JDK-6443578https://github.com/gradle/gradle/issues/5225https://bugs.openjdk.java.net/browse/JDK-8202525https://en.wikipedia.org/wiki/Combining_character > > <6202130-manifestutf8linebreak.patch> <http://oracle.com/us/design/oracle-email-sig-198324.gif> <http://oracle.com/us/design/oracle-email-sig-198324.gif> <http://oracle.com/us/design/oracle-email-sig-198324.gif> <http://oracle.com/us/design/oracle-email-sig-198324.gif>Lance Andersen| Principal Member of Technical Staff | +1.781.442.2037 Oracle Java Engineering 1 Network Drive Burlington, MA 01803 lance.ander...@oracle.com <mailto:lance.ander...@oracle.com>