Re: RFR for 6443578 and 6202130: UTF-8 in Manifests

Roger Riggs Fri, 04 May 2018 12:53:38 -0700

Hi Phillip,

Just a reminder that OpenJDK can *only* accept patches viacr.openjdk.java.net

(as an author) or inline or attached in email.


Thanks, Roger


On 5/2/18 9:12 PM, Philipp Kunz wrote:

Hi,

Here is patch for 6443578 and 6202130 also in webrev form.

http://files.paratix.ch/jdk/6372077and6443578/webrev.01/
http://files.paratix.ch/jdk/6372077and6443578/webrev.01.zip
Hope it helps. With all the patience, can I do anything to make iteasier to get feedback or find a sponsor?
Regards,
Philipp


On 02.05.2018 07:21, Philipp Kunz wrote:
Hi,
Recently, I tried to fix only bug 6202130 with the intention to fixbug 6443578 later with the intention to get some opportunity forfeedback, but haven't got any, and propose now a fix for bothtogether which in my opinion makes more sense.
See attached patch.

Some considerations, assumptions, and explanations

  * In my opinion, the code for writing manifests was distributed in
    the two classes Attributes and Manifest in an elegant way but
    somewhat difficult to explain the coherence. I chose to group the
    code that writes manifests into a new class ManifestWriter. The
    main incentive for that was to prevent or reduce duplicated code
    I would have had to change twice otherwise. This also results in
    a source file of a suitable size.
  * I could not support the assumption that the write and writeMain
    methods in Attributes couldn't be referenced anywhere so I
    deprecated them rather than having them removed.
  * I assumed the patch will not make it into JDK 10 and, hence, the
    deprecated annotations are attributed with since = 11.
  * I could not figure out any reason for the use of DataOutputStream
    and did not use it.
  * Performance-wise I assume that the code is approximately
    comparable to the previous version. The biggest improvement in
    this respect I hope comes from removing the String that contains
    the byte array constructed with deprecated String(byte[], int,
    int, int) and then copying it over again to a StringBuffer and
    from there to a String again and then Characters. On the other
    hand, keeping whole characters together when breaking lines might
    make it slightly slower. I hope my changes are an overall
    improvement, but I haven't measured it.
  * For telling first from continuation bytes of utf-8 characters
    apart I re-used a method isNotUtfContinuationByte from either
    StringCoding or UTF_8.Decoder. Unfortunately I found no way not
    to duplicate it.
  * Where it said before "XXX Need to handle UTF8 values and break up
    lines longer than 72 bytes" in Attributes#writeMain I did not
    dare to remove the comment completely because it still does not
    deal correctly with version headers longer than 72 bytes and the
    set of allowed values. I changed it accordingly. Two similar
    comments are removed in the patch.
  * I added two tests, WriteDeprecated and NullKeysAndValues, to
    demonstrate compatibility as good as I could. Might however not
    be desired to keep and having to maintain.
  * LineBrokenMultiByteCharacter for jarsigner should not be removed
    or not so immediately because someone might attempt to sign an
    older jarfile created without that patch with a newer jarsigner
    that already contains it.
suggested changes or additions to the bug database: (i have nopermissions to edit it myself)
  * Re-combine copies of isNotUtfContinuationByte (three by now).
    Relates to 6184334. Worth to file another issue?
  * Manifest versions have specific specifications, cannot break
    across lines and can contain a subset of characters only. Bug
    6910466 relates but is not exactly the same. If someone else is
    convinced that writing a manifest should issue a warning or any
    other way to deal with a version that does not conform to the
    specification, I'd suggest to create a separate bug for that.
Now, I would be glad if someone sponsored a review. This is only mythird attempt to submit a patch which is why I chose a lesserimportant subject to fix in order to get familiar and now Iunderstand it's not the most attractive patch to review. Please don'thesitate to suggest what I could do better or differently.
As a bonus, with these changes, manifest files will always bedisplayed correctly with just any utf capable viewer even if theycontain multi-byte utf characters that would have been broken acrossa line break with the current/previous implementation and allmanifests will become also valid strings in Java.
Regards,
Philipp



On 20.04.2018 00:58, Philipp Kunz wrote:
Hi,
I tried to fix bug 6202130 about manifest utf support and come upnow with a test as suggested in the bug's comments that shows thatutf charset actually works before removing the comments from the code.
When I wanted to remove the XXX comments about utf it occurred to methat version attributes ("Signature-Version" and "Manifest-Version")would never be broken across lines and should anyway not support thewhole utf character set which sounds more like related to bugs6910466 or 4935610 but it's not a real fit. Therefore, I could notremove one such comment of Attributes#writeMain but I changed it.The first comment in bug 6202130 mentions only two comments butthere are three in Attributes. In the attached patch I removed onlytwo of three and changed the remaining third to not mention utfanymore.
At the moment, at least until 6443578 is fixed, multi-byte utfcharacters can be broken across lines. It might be worth aconsideration to test that explicitly as well but then I guess thereis not much of a point in testing the current behavior that willchange with 6443578, hopefully soon. There are in my opinion enoughcharacters broken across lines in the attached test that demonstratethat this still works like it did before.
I would have preferred also to remove the calls to deprecatedString(byte[], int, int, int) but then figured it relates more tobug 6443578 than 6202130 and now prefer to do that in anotherseparate patch.
Bug 6202130 also states that lines are broken by String.length notby byte length. While it looks so at first glance, I could notconfirm. The combination of getBytes("UTF8"), String(byte[], int,int, int), and then DataOutputStream.writeBytes(String) in thatcombination does not drop high-bytes because every byte (whether awhole character or only a part of a multi-byte character) becomes acharacter in String(...) containing that byte in its low-byte whichwill be read again from writeBytes(...). Or put in a different way,every utf encoded byte becomes a character and multi-byte utfcharacters are converted into multiple string characters containingone byte each in their lower bytes. The current solution is notnice, but at least works. With that respect I'd like to suggest todeprecate DataOutputStream.writeBytes(String) because it doessomething not exactly expected when guessing from its name and thatwould suit a byte[] parameter better very much like it has been donewith String(byte[], int, int, int). Any advice about the procedureto deprecate something?
I was surprised that it was not trivial to list all valid utfcharacters. If someone has a better idea than isValidUtfCharacter inthe attached test, let me know.
Altogether, I would not consider 6202130 resolved completely, unlessmaybe all remaining points are copied to 6443578 and maybe anotherbug about valid values for "Signature-Version" and"Manifest-Version" if at all desired. But still I consider theattached patch an improvement and most of the remainder can then besolved in 6443578 and so far I am looking forward to any kind offeedback.
Regards,
Philipp

Re: RFR for 6443578 and 6202130: UTF-8 in Manifests

Reply via email to