Re: 6202130: Need to handle UTF-8 values and break up lines longer than 72 bytes

Philipp Kunz Mon, 10 Feb 2020 22:30:19 -0800

Hi Lance,
Thank you for the webrev.

In my opinion the two patches would only be related if we wanted to
test breaking multi-byte encoded characters across line breaks within
the splash screen image manifest header. I found this is currently
working and propose not to cover it with a specific test case.
Considering the possibility that jar files created after having applied
the 6202130 manifest line break patch manifest attributes would not
contain line broken characters any more would make such a test case
pointless and the possibility that handling the splash image might be
moved from C into Java like it has happened to the main class attribute
would make parsing the splash screen image attribute rely on
java.util.jar stuff which in turn is known to work well with UTF-8 and
also being covered with tests. These are not yet facts but additionally
reinforce my proposal not to test splash screen image attributes with
multi-byte characters broken across a line break for now.
In any case I share your opinion to discuss the patches first and then
revisit whether or not or how they relate. May I ask already now what
specific kind of relation between them you see or think of or what I
might have missed?


As for creating the jar file on the fly as I understand your comment
suggests, it would be feasible for splash screen images but not
necessarily for the classes inside of that jar file the file names of
which correspond to class names containing Unicode characters. From my
understanding of the previous version of UnicodeTest I figure that
there are platforms the file system and command lines of which don't
support Unicode. More precisely, I figure that UnicodeTest.runJarTests
runs "runTest(UnicodeTestJar);" which uses the current supported
encoding of the platform running the test for the jar's contents and
"runTest(SolarisUnicodeTestJar);" which always uses full UTF-8 encoding
for that jar's contents independent of the platform running the test.
The difference might not appear too obvious at first glance.While the
additional splash screen image part of that test could add a splash
screen image header to the manifest and an image file to the jar on the
fly, I came to support the previously chosen approach for the main
class attribute. I also don't like binary files such as the
UnicodeTest.jar file checked in to the source code but don't see an
easy and still convincingly reliable way to compile classes the file
names of which are not supported by the platform running the test. The
UnicodeTest.jar cannot be built on the fly the way it used to be on all
platforms but still the main class attribute and the splash screen
image header work with a pre-built jar when all involved parts are
inside the jar file so that neither the host file system nor the
command line ever sees any of those on all platforms. That works with
class names for the main class attribute as well as the splash screen
image.
Regards,Philipp



On Mon, 2020-02-10 at 14:45 -0500, Lance Andersen wrote:
> Hi Phillip,
> Thank you for the update patch(s)
> 
> 
> I created a webrev for 6202130 and can be found here: 
> http://cr.openjdk.java.net/~lancea/6202130/webrev.00
> 
> I think this needs to be addressed prior to discussions on the other.
>  It would be best for your 2nd patch to create the jar via the test.
>  You could save a byte array containing the contents of the jar and
> have the test write them out  as part of the set up
> 
> Best
> Lance
> 
> > On Feb 10, 2020, at 2:15 PM, Philipp Kunz <philipp.k...@paratix.ch>
> > wrote:
> > Hi Lance,
> > Thank you for your answer. The patch was indeed outdated and I
> > could reproduce your problem with it. There were minor merge
> > conflicts which is why the new patches have slightly different
> > contents than the previous ones. Attached are two patches that
> > should work better now. Sorry for the inefficient first round.
> > An additional difficulty comes with the patch for the splash screen
> > image. It is meant to change UnicodeTest.jar which is not included
> > in the attached patch. I produced the new version with the
> > following commands after having applied the patch:
> > ../jtreg/bin/jtreg -va -nr -jdk:build/linux-x86_64-server-
> > release/images/jdk/ -Xshare:off
> > test/jdk/tools/launcher/UnicodeTest.javacp
> > JTwork/scratch/UnicodeTest.jar test/jdk/tools/launcher/
> > I figure it is important to build it with the real reference
> > configuration presumably on Solaris like the previous version as
> > the test comments and then add that jar with the patch.
> > Regards,Philipp
> > 
> > 
> > On Sat, 2020-02-08 at 10:06 -0500, Lance Andersen wrote:
> > > Hi Philipp,
> > > I tried to apply your patch via hg import --no-commit but it
> > > fails for me
> > > 
> > > Can you please forward the file via email and I will apply them
> > > and generate a webrev
> > > 
> > > Thank you
> > > 
> > > > On Dec 26, 2019, at 11:50 AM, Philipp Kunz <
> > > > philipp.k...@paratix.ch> wrote:
> > > > Hi,
> > > > The specification says, a line break in a manifest can occur
> > > > beforeor
> > > > after a Unicode character encoded in UTF-8.
> > > > >    ...>      value:         SPACE *otherchar newline
> > > > *continuation>      continuation:  SPACE *otherchar
> > > > newline>    ...>      otherchar:     any UTF-8 character except
> > > > NUL, CR
> > > > and LF
> > > > The current implementation breaks manifest lines at 72 bytes
> > > > regardless
> > > > ofhow the bytes around the break are part of a sequence of
> > > > bytes
> > > > encoding acharacter. Code points may use up to four bytes when
> > > > encoded
> > > > in UTF-8.Manifests with line breaks inside of sequences of
> > > > bytes
> > > > encoding Unicodecharacters in UTF-8 with more than one bytes
> > > > not only
> > > > are invalid UTF-8but also look ugly in text editors.For
> > > > example, a
> > > > manifest could look like this:
> > > > import java.util.jar.Manifest;import
> > > > java.util.jar.Attributes;import
> > > > static java.util.jar.Attributes.Name;
> > > > public class CharacterBrokenDemo1 {    public static void
> > > > main(String[]
> > > > args) throws Exception{        Manifest mf = new
> > > > Manifest();        Attributes attrs =
> > > > mf.getMainAttributes();        attrs.put(Name.MANIFEST_VERSION,
> > > > "1.0");        attrs.put(new Name("Some-Key"),
> > > >                  "Some
> > > > languages have decorated characters, " +                   "for
> > > > example: español"); // or
> > > > "espa\u00D1ol"        mf.write(System.out);    }}
> > > > Above code produces a result as follows with some unexpected
> > > > question
> > > > markswhere the encoding is invalid:
> > > > >    Manifest-Version: 1.0>    Some-Key: Some languages have
> > > > > decorated
> > > > characters, for example: espa?>     ?ol
> > > > This is of course an example written with actual question marks
> > > > to get
> > > > a validtext for this message. The trick here is that "Some-Key" 
> > > > to
> > > > "example :espa"amounts to exactly one byte less encoded in UTF-
> > > > 8 than
> > > > would fit on one linewith the 72 byte limit so that the
> > > > subsequent
> > > > character encoded with two bytesgets broken inside of the
> > > > sequence of
> > > > two bytes for this character across acontinuation line break.
> > > > When decoding the resulting bytes from UTF-8 as one whole
> > > > string, the
> > > > twoquestion marks will not fit together again even if the line
> > > > break
> > > > with thecontinuation space is removed. However, Manifest::read
> > > > removes
> > > > the continuationline breaks ("\r\n ") before decoding the
> > > > manifest
> > > > header value from UTF-8 andhence can reproduce the original
> > > > value.
> > > > Characters encoded in UTF-8 can not only span up to four bytes
> > > > for one
> > > > codepoint each, there are also combining characters or classes
> > > > thereof
> > > > or combiningdiacritical marks or whatever the appropriate term
> > > > could
> > > > be, that combine morethan one code point into what is usually
> > > > experienced and referred to as acharacter.
> > > > The term character really gets ambiguous at this point. I
> > > > wouldn't know
> > > > whatthe specification actually refers to with that term
> > > > "character". So
> > > > rather thandiving in too much specification or any sorts of
> > > > theory,
> > > > let's look at anotherexample:
> > > > import java.util.jar.Manifest;import
> > > > java.util.jar.Attributes;import
> > > > static java.util.jar.Attributes.Name;
> > > > public class DemoCharacterBroken2 {    public static void
> > > > main(String[]
> > > > args) throws Exception{        Manifest mf = new
> > > > Manifest();        Attributes attrs =
> > > > mf.getMainAttributes();        attrs.put(Name.MANIFEST_VERSION,
> > > > "1.0");        attrs.put(new Name("Some-Key"), " ".repeat(53) +
> > > > "Angstro\u0308m");        mf.write(System.out);    }}
> > > > which produces console output as follows:
> > > > >    Manifest-Version: 1.0>    Some-
> > > > Key:
> > > >                                                      Angstro>
> > > >     
> > > > ̈m
> > > > (In case this does not display well, the diaeresis is on the m
> > > > on the
> > > > last line)
> > > > When the whole Manifest is decoded from UTF-8 as one big single
> > > > string
> > > > andcontinuation line breaks are not removed until after UTF-8
> > > > decoding
> > > > the wholemanifest, the diaeresis (umlaut, two points above,
> > > > u0308)
> > > > apparently kind ofjumps onto the following letter m because
> > > > somehow it
> > > > cannot be combined withthe preceding space. The UTF-8 decoder
> > > > (all of
> > > > my editors I tried, not onlyEclipse and its console view, also
> > > > less,
> > > > gedit, cat and terminal) somehowtries to fix that but the
> > > > diaeresis may
> > > > not necessarily jump back on the "o"where it originally
> > > > belonged by
> > > > removing the continuation line break ("\r\n ")after UTF-8
> > > > decoding has
> > > > taken place, at least that did not work for me.
> > > > Hence, ideally combining diacritical marks should better not be
> > > > separated fromwhatever they combine with when breaking manifest
> > > > lines
> > > > onto a continuationline. Such combinations, however, seem to be
> > > > unlimited in terms of number ofcode points combining into the
> > > > same
> > > > "experienced" character. I was able tofind combinations that
> > > > not only
> > > > exceed the limit of 72 bytes per line but alsoexceed the line
> > > > buffer
> > > > size of 512 bytes in Manifest::read. These may be
> > > > ratheruncommon but
> > > > still possible to my own surprise.
> > > > Next consideration would then be to remove that limit of 512
> > > > bytes per
> > > > manifestline but exceeding it would make such manifests
> > > > incompatible
> > > > with previousManifest::read implementations and is not really
> > > > an
> > > > immediately availableoption at the moment.
> > > > As a compromise, those characters including combining
> > > > diacritical marks
> > > > whichcombine only so many code points as that their binarily
> > > > encoded
> > > > form in UTF-8remains within a limit of 71 bytes can be written
> > > > without
> > > > an interruptingcontinuation line break, which applies to most
> > > > cases,
> > > > but not all. I guess thisshould suit practically and
> > > > realistically to
> > > > be expected values well.
> > > > Another possibility would be to allow for characters that are
> > > > combinations ofmultiple Unicode code points to be kept together
> > > > in
> > > > their encoded form in UTF-8up to 512 bytes line length limit
> > > > when
> > > > reading minus a space and a line breakamounting to 509 bytes,
> > > > but that
> > > > would still not make manifests be representedas valid Unicode
> > > > in all
> > > > corner cases and I guess would not probably make a
> > > > realimprovement in
> > > > practice over a limit of 71 bytes.
> > > > Attached is a patch that tries to implement what was described
> > > > above
> > > > using aBreakIterator. While it works from a functional point of
> > > > view,
> > > > this might beless desirable performance-wise. Alternatively
> > > > could be
> > > > considered to do withoutthe BreakIterator and only keep Unicode
> > > > code
> > > > points together by not placingline breaks before a continuation
> > > > byte,
> > > > which however would not addresscombining diacritical marks as
> > > > in the
> > > > second example above.
> > > > The jar file specification does not explicitly state that
> > > > manifest
> > > > should bevalid UTF-8, and they were not always, but it also
> > > > does not
> > > > state otherwise,leaving an impression that manifests could be
> > > > (mis)taken for UTF-8 encodedstrings, which they also are in
> > > > many or
> > > > most cases and which has been confusedmany times. At the
> > > > moment, the
> > > > only case where a valid manifest is not also avalid UTF-8
> > > > encoded
> > > > string is when a sequence of bytes encoding the samecharacter
> > > > happens
> > > > to be interrupted with a continuation line break. To the bestof
> > > > my
> > > > knowledge, all other valid manifests are also valid UTF-8
> > > > encoded
> > > > strings.
> > > > It would be nice, if manifests could be viewed and manipulated
> > > > with all
> > > > Unicodecapable editors and not only be parsed correctly with
> > > > Manifest::read.
> > > > Any opinions? Would someone sponsor this patch?
> > > > Regards,Philipp
> > > > 
> > > > https://docs.oracle.com/en/java/javase/13/docs/specs/jar/jar.html#specificationhttps://bugs.openjdk.java.net/browse/JDK-6202130https://bugs.openjdk.java.net/browse/JDK-6443578https://github.com/gradle/gradle/issues/5225https://bugs.openjdk.java.net/browse/JDK-8202525https://en.wikipedia.org/wiki/Combining_character
> > > > 
> > > > <6202130-manifestutf8linebreak.patch>
> > > 
> > > <oracle_sig_logo.gif>
> > > Lance Andersen| Principal Member of Technical Staff |
> > > +1.781.442.2037
> > > Oracle Java Engineering 
> > > 1 Network Drive 
> > > Burlington, MA 01803
> > > lance.ander...@oracle.com
> > > 
> > > 
> > > 
> > 
> > <20200210-6202130-manifestutf8linebreak.patch><20200210-
> > splashscreenimageunicodetestwithoutunicodetestjar.patch>
> 
> 
> Lance Andersen| Principal Member of Technical Staff | +1.781.442.2037
> Oracle Java Engineering 
> 1 Network Drive 
> Burlington, MA 01803
> lance.ander...@oracle.com
> 
> 
>

Re: 6202130: Need to handle UTF-8 values and break up lines longer than 72 bytes

Reply via email to