Hi, On 2020-04-21 19:19, Martin Buchholz wrote:
Thanks for doing this. (even though I would not have)
at least I learned a few things doing it. :-)
I'd be more inclined to have separate code for UTF-8 encoded zip files, that everyone should be using now, rather than have special code for ASCII-compatible. Perhaps a single ZipCoder subclass Utf8ZipCoder.
Yes, I think this is reasonable path forward. Eirik has been sending me suggestions to improve on this in similar ways, too. Since we have a few additional optimizations queued up, where it might be useful or even necessary to split the UTF-8 implementation out into a fast path to keep it both correct and sane, I'd like to hold back on well-meaning improvements from this bug fix and focus on making sure the behavior is correct.
I am the author of that test infrastructure boiler plate you see at the bottom of tests. It's not useful to modify that. More useful would be to migrate tests to junit or testng, but that's a lot of work.
IDE saw unused code. :-) I can revert those changes as they're not important.
Are we really willing to maintain tests for UTF-8, ASCII-compatible, and ASCII-incompatible code paths?
If we aren't then I think we should terminally deprecate all the ZipFile constructors that take a Charset argument. Not a call I can make.
ASCII-compatibility is also deeply embedded in the original C code, which is still there
Yep, the assumption go way back, but as long as it's functionality only relevant to jar files it's fine in both impls. Is there any usage left of the C implementation these days? /Claes
/* * Returns true if the specified entry's name begins with the string * "META-INF/" irrespective of case. */ static int isMetaName(const char *name, int length) { const char *s; if (length < (int)sizeof("META-INF/") - 1) return 0; for (s = "META-INF/"; *s != '\0'; s++) { char c = *name++; // Avoid toupper; it's locale-dependent if (c >= 'a' && c <= 'z') c += 'A' - 'a'; if (*s != c) return 0; } return 1; }On Tue, Apr 21, 2020 at 6:51 AM Claes Redestad <[email protected] <mailto:[email protected]>> wrote:Hi, ZipFile.getEntry does optimizations to check for directory entries by adding a '/' to the encoded byte array. JDK-8242959 improved on this optimization, but a question was raised Jason Zaugg on whether the optimization is always valid[1]. Turns out it isn't, but only for non-ASCII compatible charsets. While JarFiles are always assumed to be UTF-8-encoded or compatible, ZipFiles allow arbitrary encoding. E.g., UTF-16 would encode '/' (2F) as either 2F 00 or 00 2F, which means the hash code would differ and a directory "foo/" potentially not be found when looking up "foo". Further complications arise when/if the directory name ends with a code point that might be encoded so that the final byte is 2F, e.g. \u012F. This patch only enables this low-level optimization when the charset encoding used is known to be ASCII compatible in the sense that 2F will be encoded as single-byte 2F. UTF-8 is compatible in this sense - and since this is the charset exclusively used by JarFile these changes have little to no effect on startup performance in the cases we've been looking at. Webrev: http://cr.openjdk.java.net/~redestad/8243254/open.00/ Bug: https://bugs.openjdk.java.net/browse/JDK-8243254 I've partially repurposed ZipFile/TestZipFile to test some such corner cases. Mainly expanded it to test ZipFiles created with various non- standard encodings, but also slimmed it down so that it can actually run quickly and reliably - as well as enabled it in regular testing. The updated test fails both before and after JDK-8242959, but passes with the proposed changes. Testing: tier1+2 Thanks! /Claes [1] https://twitter.com/retronym/status/1252134723431747584?s=20
