Hi, Philipp. Sorry for the further delay.
Just to step back a bit:
The current manifest-writing behavior dates back to JDK 1.2 (at least),
so compatibility is an overriding concern. It is unfortunate that
multi-byte characters weren't better accounted for in the placement of
line breaks from the beginning.
But typically in cases like this, we would update the specification to
fit the long-standing behavior, and avoid the risk of breaking
applications in the wild.
We can test that manifests written with your changes can be read by the
JDK. But there is also manifest-reading code external to the JDK. Some
examples, that would need to be investigated:
jarmanifest
https://pypi.org/project/jarmanifest/
Chromium
https://chromium.googlesource.com/external/googleappengine/python/+/7e0ab775c587657f0f93a3134f2db99e46bb98bd/google/appengine/tools/jarfile.py
Maven
http://maven.apache.org/shared/maven-archiver/index.html
Apache Ant
https://en.wikipedia.org/wiki/JAR_(file_format)#Apache_Ant_Zip/JAR_support
IDEs would be another good place to check, and maybe also consumers of
JavaEE WAR / EAR files.
Performance is also a consideration. JMH benchmarks would be needed to
confirm that reading the new manifest is not slower, and to gauge any
regression in Manifest writing performance.
So that is the work that would need to be done to support this change.
The question then will be if the benefit of this change (seen only
outside of running code) outweighs the risk of changing behavior that
has not been an issue for 20+ years. It seems unlikely that a strong
enough case could be made.
-Brent
On 4/13/20 10:29 AM, Philipp Kunz wrote:
Hi Naoto,
You are absolutely right to raise the question. I've also thought about
this but haven't come up so far with a compellingly elegant solution,
at least not yet.
If only String.isLatin1 was public that would come in very handy.
Character or anything else I looked at cannot tell if a string is
ascii. BreakIterator supports iterating backwards so we could start at
the potential line end but that requires a start position that is a
boundary to start with and that is not obviously possible due to the
regional indicators and probably other code points requiring stateful
processing. Same with regexes and I wouldn't know how to express groups
that could count bytes. It does not even seem to be possible to guess
any number of characters to start searching for a boundary because of
the statefulness. Even the most simple approach to detect latin1
Strings requires an encoding or a regex such as "[^\\p{ASCII}]" which
essentially is another inefficient loop. It also does not work to
encode the string into UTF-8 in a single pass because then it is not
known which grapheme boundary matches to which byte position. I also
tried to write the header names and the ": " delimiter without breaking
first but it did not seem to significantly affect performance.
UTF_8.newEncoder cannot process single surrogates, admittedly an edge
case, but required for compatibility. I added a fast path, see patch,
the best way I could think of. Did I miss a better way to tell ascii
strings from others?
What I found actually improving performance is based on the
consideration that strings are composed of primitive chars that will be
encoded into three bytes each maximum always that up to 24 characters
can be passed to writeChar72 at a time reducing the number of
writeChar72 and in turn String.getBytes invocations. The number of
characters that can be passed to writeChar72 is at most the number of
bytes remaining on the current manifest line (linePos) divided by three
but at least one. See attached patch.
Regards,Philipp
On Mon, 2020-03-30 at 14:31 -0700, naoto.s...@oracle.com wrote:
Hi Philipp,
Sorry for the delay.
On 3/25/20 11:52 AM, Philipp Kunz wrote:
Hi Naoto,
See another patch attached with Locale.ROOT for the BreakIterator.
I will be glad to hear of any feedback.
I wonder how your change affects the performance, as it will do
String.getBytes(UTF-8) per each character. I think this can
definitely be improved by adding some fastpath, e.g., for ASCII. The
usage of the BreakIterator is fine, though.
There is another patch [1] around dealing with manifest attributes
during application launch. It certainly is related to 6202130 but
feels like a distinct set of changes with a slightly different
concern. Any opinion as how to proceed with that one?
I am not quite sure which patch you are referring to, but I agree
that creating an issue would not hurt.
Naoto
Regards,Philipp
[1]
https://mail.openjdk.java.net/pipermail/core-libs-dev/2020-February/064720.html
On Mon, 2020-03-23 at 09:05 -0700, naoto.s...@oracle.com wrote:
Hi Philipp,
Right, Locale.ROOT is more appropriate here by definition, though
theimplementation is the same.
Naoto
On 3/21/20 5:19 AM, Philipp Kunz wrote:
Hi Naoto and everyone,
There are almost as many occurrences of Locale.ROOT as
Locale.US whichmade me wonder which one is more appropriately
locale-independent andwhich is probably another topic and not
actually relevant here
becauseBreakIterator.getCharacterInstance is locale-agnostic as
far as I can tell.
See attached patch with another attempt to fix bug 6202130.
Regards,Philipp
On Tue, 2020-03-10 at 10:45 -0700,naoto.s...@oracle.com
<mailto:naoto.s...@oracle.com> wrote:
Hi Philipp,
..., so using BreakIterator (withLocale.US) is more preferred
solution to me.
Naoto