Hi,

On 2020-04-21 19:19, Martin Buchholz wrote:
Thanks for doing this.  (even though I would not have)

at least I learned a few things doing it. :-)


I'd be more inclined to have separate code for UTF-8 encoded zip files, that everyone should be using now, rather than have special code for ASCII-compatible.  Perhaps a single ZipCoder subclass Utf8ZipCoder.
Yes, I think this is reasonable path forward. Eirik has been sending me
suggestions to improve on this in similar ways, too.

Since we have a few additional optimizations queued up, where it might
be useful or even necessary to split the UTF-8 implementation out into a
fast path to keep it both correct and sane, I'd like to hold back on
well-meaning improvements from this bug fix and focus on making sure the
behavior is correct.


I am the author of that test infrastructure boiler plate you see at the bottom of tests.  It's not useful to modify that.  More useful would be to migrate tests to junit or testng, but that's a lot of work.

IDE saw unused code. :-) I can revert those changes as they're not
important.


Are we really willing to maintain tests for UTF-8, ASCII-compatible, and ASCII-incompatible code paths?

If we aren't then I think we should terminally deprecate all the ZipFile
constructors that take a Charset argument. Not a call I can make.


ASCII-compatibility is also deeply embedded in the original C code, which is still there

Yep, the assumption go way back, but as long as it's functionality only
relevant to jar files it's fine in both impls.

Is there any usage left of the C implementation these days?

/Claes


/*
  * Returns true if the specified entry's name begins with the string
  * "META-INF/" irrespective of case.
  */
static int
isMetaName(const char *name, int length)
{
     const char *s;
     if (length < (int)sizeof("META-INF/") - 1)
         return 0;
     for (s = "META-INF/"; *s != '\0'; s++) {
         char c = *name++;
         // Avoid toupper; it's locale-dependent
         if (c >= 'a' && c <= 'z') c += 'A' - 'a';
         if (*s != c)
             return 0;
     }
     return 1;
}

On Tue, Apr 21, 2020 at 6:51 AM Claes Redestad <[email protected] <mailto:[email protected]>> wrote:

    Hi,

    ZipFile.getEntry does optimizations to check for directory entries by
    adding a '/' to the encoded byte array. JDK-8242959 improved on this
    optimization, but a question was raised Jason Zaugg on whether the
    optimization is always valid[1]. Turns out it isn't, but only for
    non-ASCII compatible charsets.

    While JarFiles are always assumed to be UTF-8-encoded or compatible,
    ZipFiles allow arbitrary encoding. E.g., UTF-16 would encode '/' (2F) as
    either 2F 00 or 00 2F, which means the hash code would differ and a
    directory "foo/" potentially not be found when looking up "foo". Further
    complications arise when/if the directory name ends with a code point
    that might be encoded so that the final byte is 2F, e.g. \u012F.

    This patch only enables this low-level optimization when the charset
    encoding used is known to be ASCII compatible in the sense that 2F will
    be encoded as single-byte 2F. UTF-8 is compatible in this sense - and
    since this is the charset exclusively used by JarFile these changes have
    little to no effect on startup performance in the cases we've been
    looking at.

    Webrev: http://cr.openjdk.java.net/~redestad/8243254/open.00/
    Bug: https://bugs.openjdk.java.net/browse/JDK-8243254

    I've partially repurposed ZipFile/TestZipFile to test some such corner
    cases. Mainly expanded it to test ZipFiles created with various non-
    standard encodings, but also slimmed it down so that it can actually run
    quickly and reliably - as well as enabled it in regular testing. The
    updated test fails both before and after JDK-8242959, but passes with
    the proposed changes.

    Testing: tier1+2

    Thanks!

    /Claes

    [1] https://twitter.com/retronym/status/1252134723431747584?s=20

Reply via email to