Re: 6202130: Need to handle UTF-8 values and break up lines longer than 72 bytes

Brent Christian Fri, 08 May 2020 11:08:26 -0700

Hi, Philipp.  Sorry for the further delay.

Just to step back a bit:

The current manifest-writing behavior dates back to JDK 1.2 (at least),so compatibility is an overriding concern. It is unfortunate thatmulti-byte characters weren't better accounted for in the placement ofline breaks from the beginning.

But typically in cases like this, we would update the specification tofit the long-standing behavior, and avoid the risk of breakingapplications in the wild.

We can test that manifests written with your changes can be read by theJDK. But there is also manifest-reading code external to the JDK. Someexamples, that would need to be investigated:


jarmanifest
https://pypi.org/project/jarmanifest/

Chromium
https://chromium.googlesource.com/external/googleappengine/python/+/7e0ab775c587657f0f93a3134f2db99e46bb98bd/google/appengine/tools/jarfile.py

Maven
http://maven.apache.org/shared/maven-archiver/index.html

Apache Ant
https://en.wikipedia.org/wiki/JAR_(file_format)#Apache_Ant_Zip/JAR_support

IDEs would be another good place to check, and maybe also consumers ofJavaEE WAR / EAR files.

Performance is also a consideration. JMH benchmarks would be needed toconfirm that reading the new manifest is not slower, and to gauge anyregression in Manifest writing performance.



So that is the work that would need to be done to support this change.

The question then will be if the benefit of this change (seen onlyoutside of running code) outweighs the risk of changing behavior thathas not been an issue for 20+ years. It seems unlikely that a strongenough case could be made.


-Brent

On 4/13/20 10:29 AM, Philipp Kunz wrote:

Hi Naoto,
You are absolutely right to raise the question. I've also thought about
this but haven't come up so far with a compellingly elegant solution,
at least not yet.
If only String.isLatin1 was public that would come in very handy.
Character or anything else I looked at cannot tell if a string is
ascii. BreakIterator supports iterating backwards so we could start at
the potential line end but that requires a start position that is a
boundary to start with and that is not obviously possible due to the
regional indicators and probably other code points requiring stateful
processing. Same with regexes and I wouldn't know how to express groups
that could count bytes. It does not even seem to be possible to guess
any number of characters to start searching for a boundary because of
the statefulness. Even the most simple approach to detect latin1
Strings requires an encoding or a regex such as "[^\\p{ASCII}]" which
essentially is another inefficient loop. It also does not work to
encode the string into UTF-8 in a single pass because then it is not
known which grapheme boundary matches to which byte position. I also
tried to write the header names and the ": " delimiter without breaking
first but it did not seem to significantly affect performance.
UTF_8.newEncoder cannot process single surrogates, admittedly an edge
case, but required for compatibility. I added a fast path, see patch,
the best way I could think of. Did I miss a better way to tell ascii
strings from others?
What I found actually improving performance is based on the
consideration that strings are composed of primitive chars that will be
encoded into three bytes each maximum always that up to 24 characters
can be passed to writeChar72 at a time reducing the number of
writeChar72 and in turn String.getBytes invocations. The number of
characters that can be passed to writeChar72 is at most the number of
bytes remaining on the current manifest line (linePos) divided by three
but at least one. See attached patch.
Regards,Philipp

On Mon, 2020-03-30 at 14:31 -0700, [email protected] wrote:

Hi Philipp,
Sorry for the delay.
On 3/25/20 11:52 AM, Philipp Kunz wrote:

Hi Naoto,
See another patch attached with Locale.ROOT for the BreakIterator.
I will be glad to hear of any feedback.


I wonder how your change affects the performance, as it will do
String.getBytes(UTF-8) per each character. I think this can
definitely be improved by adding some fastpath, e.g., for ASCII. The
usage of the BreakIterator is fine, though.

There is another patch [1] around dealing with manifest attributes
during application launch. It certainly is related to 6202130 but
feels like a distinct set of changes with a slightly different
concern. Any opinion as how to proceed with that one?


I am not quite sure which patch you are referring to, but I agree
that creating an issue would not hurt.
Naoto

Regards,Philipp


[1]
https://mail.openjdk.java.net/pipermail/core-libs-dev/2020-February/064720.html


On Mon, 2020-03-23 at 09:05 -0700, [email protected] wrote:

Hi Philipp,
Right, Locale.ROOT is more appropriate here by definition, though
theimplementation is the same.
Naoto
On 3/21/20 5:19 AM, Philipp Kunz wrote:

Hi Naoto and everyone,
There are almost as many occurrences of Locale.ROOT as
Locale.US whichmade me wonder which one is more appropriately
locale-independent andwhich is probably another topic and not
actually relevant here
becauseBreakIterator.getCharacterInstance is locale-agnostic as
far as I can tell.
See attached patch with another attempt to fix bug 6202130.
Regards,Philipp

On Tue, 2020-03-10 at 10:45 -0700,[email protected]
  <mailto:[email protected]>  wrote:

Hi Philipp,
..., so using BreakIterator (withLocale.US) is more preferred
solution to me.
Naoto

Re: 6202130: Need to handle UTF-8 values and break up lines longer than 72 bytes

Reply via email to