On 2016-03-25 16:03, Paul Sandoz wrote:
On 25 Mar 2016, at 13:56, Claes Redestad <[email protected]> wrote:You've replaced the precomputed good suffix shift which i think makes the Math.max in math a bit harder to read. A comment or just move a local for "j < len - 1 ? len : 1" would make it a bit easier to read. It would be good to move the value of MULTIRELEASE_CHARS to its own line as the current line is requires scrolling when looking at the side-by-side diffs.Sure: http://cr.openjdk.java.net/~redestad/8152733/webrev.02/+1
Thanks!
We could marginally improve the bad shift too:
int badShift = (c < 128) ? lastOcc[c] : 0;
rather than masking (which might lop off the top bit of bytes of value encoded
characters), that’s clearer code-wise to me. In fact i bet we could reduce the
last occurrence array by half if we canonicalize to upper-case characters and
bound the range to [‘ ‘, ‘Z’].
byte c = b[i + j];
int goodShift = 0;
int badShift = 0;
if (c >= ‘ ‘ && c <= ‘z') {
if (c >= ‘a’) c -= 32; // Canonicalize
if (c != src[j]) {
// no match
goodShift = (j < len - 1) ? len : 1;
badShift = lastOcc[c - 32];
}
} else {
// no match, character not valid for name
goodShift = (j < len - 1) ? len : 1;
}
if (goodShift > 0) {
i += Math.max(j + 1 - badShift, goodShift);
continue next;
}
Just idle musings on my part, feel free to ignore ‘em.
Interesting, my only concern is we'd add more code/complexity, but it seems the shift will always be == len when not matching a character in src (the proof is left out as an exercise), so this can be simplified further to:
byte c = b[i + j];
if (c >= ' ' && c <= 'z') {
if (c >= 'a') c -= 32; // Canonicalize
if (c != src[j]) {
// no match
int goodShift = (j < len - 1) ? len : 1;
int badShift = lastOcc[c - 32];
i += Math.max(j + 1 - badShift, goodShift);
continue next;
}
} else {
// no match, character not valid for name
i += len;
continue next;
}
http://cr.openjdk.java.net/~redestad/8152733/webrev.03/
(included Steve's suggestion)
Since this avoids reading from lastOcc for anything but the chars
actually matching, this actually seems to improve things a bit locally,
so I'm willing to accept the pile-on. I'll run some numbers in the lab...
/Claes
Paul.Difference from previous for convenience: diff -r fc748fa355e0 src/java.base/share/classes/java/util/jar/JarFile.java --- a/src/java.base/share/classes/java/util/jar/JarFile.java Fri Mar 25 13:38:37 2016 +0100 +++ b/src/java.base/share/classes/java/util/jar/JarFile.java Fri Mar 25 13:47:31 2016 +0100 @@ -887,11 +887,15 @@ } // Statics for hand-coded Boyer-Moore search - private static final byte[] CLASSPATH_CHARS = {'c','l','a','s','s','-','p','a','t','h', ':', ' '}; + private static final byte[] CLASSPATH_CHARS = + {'c','l','a','s','s','-','p','a','t','h', ':', ' '}; + // The bad character shift for "class-path:" private static final byte[] CLASSPATH_LASTOCC; - private static final byte[] MULTIRELEASE_CHARS = {'m','u','l','t','i','-','r','e','l','e', 'a', 's', 'e', ':', ' '}; + private static final byte[] MULTIRELEASE_CHARS = + {'m','u','l','t','i','-','r','e','l','e', 'a', 's', 'e', ':', ' '}; + // The bad character shift for "multi-release: " private static final byte[] MULTIRELEASE_LASTOCC; @@ -959,6 +963,8 @@ /** * Returns true if the pattern {@code src} is found in {@code b}. * The {@code lastOcc} array is the precomputed bad character shifts. + * Since there are no repeated substrings in our search strings, + * the good character shifts can be replaced with a comparison. */ private boolean match(byte[] src, byte[] b, byte[] lastOcc) { int len = src.length; @@ -972,7 +978,8 @@ c += 32; } if (c != src[j]) { - i += Math.max(j + 1 - lastOcc[c&0x7F], j < len - 1 ? len : 1); + int goodShift = (j < len - 1) ? len : 1; + i += Math.max(j + 1 - lastOcc[c&0x7F], goodShift); continue next; } } Thanks! /Claes
