Re: RFR: 8243469: Lazily encode name in ZipFile.getEntryPos

2020-04-27 Thread Claes Redestad

On 2020-04-27 17:16, Lance Andersen wrote:

Hi Claes,

The changes and the performance bump all look good and with the minor 
change below helps the readability.


Thanks! Pushed.

/Claes



Thank you for using your performance expertise to improve this area.

Best
Lance

On Apr 27, 2020, at 6:11 AM, Claes Redestad > wrote:




On 2020-04-27 11:49, Volker Simonis wrote:

On Sun, Apr 26, 2020 at 11:34 PM Claes Redestad
mailto:claes.redes...@oracle.com>> wrote:


Hi again,

On 2020-04-24 21:22, Claes Redestad wrote:

It seems that 'getEntryHitUncached' is getting slightly slower with
your change while all the other variants get significantly faster. I
don't think that's a problem, but do you have an explanation why
that's the case?


I've noticed it swing a bit either way, and have been asking myself the
same thing. After a little analysis I think it's actually a bug in my
microbenchmark: I'm always looking up the same entry, and thus hitting
the same bucket in the hash table. If that one has a collision, 
we'll do

a few extra passes. If not, we won't. This might be reflected as a
significant swing in either direction.

I'm going to try rewriting it to consider more (if not all) entries in
the zip file. That should mean the cost averages out a bit.


after I improved my micro to root out sources of variance, the
performance issue for hits persisted.

Luckily Eirik had a brilliant idea: Why not decode the bytes in the
cen to a String and compare that, rather than the other way around?
To some surprise it turns out this gives us about a ~1.2x speedup for
getEntryHit and getEntryHitUncached over open.00 - and comfortably
just ahead of the baseline on getEntryHitUncached[1]. It also leads to
slightly cleaner code[2].

Webrev: http://cr.openjdk.java.net/~redestad/8243469/open.01/

The speed-up appears to come from String.equals, which is intrinsified
and significantly faster than the replaced loop. I profiled allocation
per operation and it stays the same (EA removes the String).


Great! Another nice improvement. The changes look good to me.


Thanks!


Following just two minor remarks:
In ZipCoder.normalizedHashDecode() you've changed the line:
if (limit > 0 && decoded[limit - 1] != '/') {
to:
if (limit > pos && decoded[limit - 1] != '/') {
which was first a little confusing to me. But in the end it turns out
that this is semantically the same, because the
CharsetDecoder.decode() method called before is guaranteed to return a
"newly-allocated character buffer" and its "position will be zero and
its limit will follow the last character written". This also explains
why you don't have to take the CharBuffer's "arrayOffset()" into
account if you use the CharBuffer's backing array (because it will
always be 0 for newly created buffers). So maybe you can put in some
comments to make it less confusing for the ingenuous reader:
CharBuffer cb = decoder().decode(ByteBuffer.wrap(a, off, end - off));
// 'cb' is a newly allocated CharBuffer with 'pos == 0'
int pos = cb.position();
int limit = cb.limit();
char[] decoded;
if (cb.hasArray()) {
// 'cb.arrayOffset()' is zero for newly allocated CharBuffers
decoded = cb.array();
} else {
decoded = new char[limit - pos];
cb.get(decoded);
}
I think you can also remove the "else" branch (and maybe replace it
with an assertion) because newly allocated CharBuffers are guaranteed
to be backed by an array with array offset zero (see
https://docs.oracle.com/en/java/javase/14/docs/api/java.base/java/nio/CharBuffer.html#allocate(int)
).


Yes, it does seem the specification is pretty strong here and we can
assume that pos == 0, arrayOffset == 0 and cb.hasArray() == true.

I'll simplify to:

   // 'cb' is a newly allocated CharBuffer with 'pos == 0',
   // 'arrayOffset == 0', backed by an array.
   CharBuffer cb = decoder().decode(ByteBuffer.wrap(a, off, 
end - off));

   int limit = cb.limit();
   char[] decoded = cb.array();
   for (int i = 0; i < limit; i++) {
   h = 31 * h + decoded[i];
   }
   if (limit > 0 && decoded[limit - 1] != '/') {
   h = 31 * h + '/';
   }

An assert seems like overkill.


Zipcoder.get() seems to be the only remaining if block without braces.
Maybe you'll wnat to fix that once your on it?
public static ZipCoder get(Charset charset) {
if (charset == UTF_8.INSTANCE)
return UTF8;
return new ZipCoder(charset);


Sure.

/Claes


Thumbs up from my side. There's no need for a new webrev from my side.
Best regards,
Volker

Testing: tier1-4

Thanks!

/Claes

[1]
Baseline:
Benchmark   (size) Mode Cnt   Score   Error 
Units

ZipFileGetEntry.getEntryHit  512  avgt   15  126.264 ± 5.297
  ns/op
ZipFileGetEntry.getEntryHit 1024  avgt   15  130.823 ± 7.212
  ns/op
ZipFileGetEntry.getEntryHitUncached  512  avgt   15  152.149 ± 4.978
  ns/op

Re: RFR: 8243469: Lazily encode name in ZipFile.getEntryPos

2020-04-27 Thread Lance Andersen
Hi Claes,

The changes and the performance bump all look good and with the minor change 
below helps the readability.

Thank you for using your performance expertise to improve this area.

Best
Lance

> On Apr 27, 2020, at 6:11 AM, Claes Redestad  wrote:
> 
> 
> 
> On 2020-04-27 11:49, Volker Simonis wrote:
>> On Sun, Apr 26, 2020 at 11:34 PM Claes Redestad
>>  wrote:
>>> 
>>> Hi again,
>>> 
>>> On 2020-04-24 21:22, Claes Redestad wrote:
> It seems that 'getEntryHitUncached' is getting slightly slower with
> your change while all the other variants get significantly faster. I
> don't think that's a problem, but do you have an explanation why
> that's the case?
 
 I've noticed it swing a bit either way, and have been asking myself the
 same thing. After a little analysis I think it's actually a bug in my
 microbenchmark: I'm always looking up the same entry, and thus hitting
 the same bucket in the hash table. If that one has a collision, we'll do
 a few extra passes. If not, we won't. This might be reflected as a
 significant swing in either direction.
 
 I'm going to try rewriting it to consider more (if not all) entries in
 the zip file. That should mean the cost averages out a bit.
>>> 
>>> after I improved my micro to root out sources of variance, the
>>> performance issue for hits persisted.
>>> 
>>> Luckily Eirik had a brilliant idea: Why not decode the bytes in the
>>> cen to a String and compare that, rather than the other way around?
>>> To some surprise it turns out this gives us about a ~1.2x speedup for
>>> getEntryHit and getEntryHitUncached over open.00 - and comfortably
>>> just ahead of the baseline on getEntryHitUncached[1]. It also leads to
>>> slightly cleaner code[2].
>>> 
>>> Webrev: http://cr.openjdk.java.net/~redestad/8243469/open.01/
>>> 
>>> The speed-up appears to come from String.equals, which is intrinsified
>>> and significantly faster than the replaced loop. I profiled allocation
>>> per operation and it stays the same (EA removes the String).
>>> 
>> Great! Another nice improvement. The changes look good to me.
> 
> Thanks!
> 
>> Following just two minor remarks:
>> In ZipCoder.normalizedHashDecode() you've changed the line:
>> if (limit > 0 && decoded[limit - 1] != '/') {
>> to:
>> if (limit > pos && decoded[limit - 1] != '/') {
>> which was first a little confusing to me. But in the end it turns out
>> that this is semantically the same, because the
>> CharsetDecoder.decode() method called before is guaranteed to return a
>> "newly-allocated character buffer" and its "position will be zero and
>> its limit will follow the last character written". This also explains
>> why you don't have to take the CharBuffer's "arrayOffset()" into
>> account if you use the CharBuffer's backing array (because it will
>> always be 0 for newly created buffers). So maybe you can put in some
>> comments to make it less confusing for the ingenuous reader:
>> CharBuffer cb = decoder().decode(ByteBuffer.wrap(a, off, end - off));
>> // 'cb' is a newly allocated CharBuffer with 'pos == 0'
>> int pos = cb.position();
>> int limit = cb.limit();
>> char[] decoded;
>> if (cb.hasArray()) {
>> // 'cb.arrayOffset()' is zero for newly allocated CharBuffers
>> decoded = cb.array();
>> } else {
>> decoded = new char[limit - pos];
>> cb.get(decoded);
>> }
>> I think you can also remove the "else" branch (and maybe replace it
>> with an assertion) because newly allocated CharBuffers are guaranteed
>> to be backed by an array with array offset zero (see
>> https://docs.oracle.com/en/java/javase/14/docs/api/java.base/java/nio/CharBuffer.html#allocate(int)
>> ).
> 
> Yes, it does seem the specification is pretty strong here and we can
> assume that pos == 0, arrayOffset == 0 and cb.hasArray() == true.
> 
> I'll simplify to:
> 
>// 'cb' is a newly allocated CharBuffer with 'pos == 0',
>// 'arrayOffset == 0', backed by an array.
>CharBuffer cb = decoder().decode(ByteBuffer.wrap(a, off, end - 
> off));
>int limit = cb.limit();
>char[] decoded = cb.array();
>for (int i = 0; i < limit; i++) {
>h = 31 * h + decoded[i];
>}
>if (limit > 0 && decoded[limit - 1] != '/') {
>h = 31 * h + '/';
>}
> 
> An assert seems like overkill.
> 
>> Zipcoder.get() seems to be the only remaining if block without braces.
>> Maybe you'll wnat to fix that once your on it?
>> public static ZipCoder get(Charset charset) {
>> if (charset == UTF_8.INSTANCE)
>> return UTF8;
>> return new ZipCoder(charset);
> 
> Sure.
> 
> /Claes
> 
>> Thumbs up from my side. There's no need for a new webrev from my side.
>> Best regards,
>> Volker
>>> Testing: tier1-4
>>> 
>>> Thanks!
>>> 
>>> /Claes
>>> 
>>> [1]
>>> Baseline:
>>> Benchmark   (size) Mode Cnt   Score   Error Units
>>> 

Re: RFR: 8243469: Lazily encode name in ZipFile.getEntryPos

2020-04-27 Thread Claes Redestad




On 2020-04-27 11:49, Volker Simonis wrote:

On Sun, Apr 26, 2020 at 11:34 PM Claes Redestad
 wrote:


Hi again,

On 2020-04-24 21:22, Claes Redestad wrote:

It seems that 'getEntryHitUncached' is getting slightly slower with
your change while all the other variants get significantly faster. I
don't think that's a problem, but do you have an explanation why
that's the case?


I've noticed it swing a bit either way, and have been asking myself the
same thing. After a little analysis I think it's actually a bug in my
microbenchmark: I'm always looking up the same entry, and thus hitting
the same bucket in the hash table. If that one has a collision, we'll do
a few extra passes. If not, we won't. This might be reflected as a
significant swing in either direction.

I'm going to try rewriting it to consider more (if not all) entries in
the zip file. That should mean the cost averages out a bit.


after I improved my micro to root out sources of variance, the
performance issue for hits persisted.

Luckily Eirik had a brilliant idea: Why not decode the bytes in the
cen to a String and compare that, rather than the other way around?
To some surprise it turns out this gives us about a ~1.2x speedup for
getEntryHit and getEntryHitUncached over open.00 - and comfortably
just ahead of the baseline on getEntryHitUncached[1]. It also leads to
slightly cleaner code[2].

Webrev: http://cr.openjdk.java.net/~redestad/8243469/open.01/

The speed-up appears to come from String.equals, which is intrinsified
and significantly faster than the replaced loop. I profiled allocation
per operation and it stays the same (EA removes the String).



Great! Another nice improvement. The changes look good to me.


Thanks!


Following just two minor remarks:

In ZipCoder.normalizedHashDecode() you've changed the line:

if (limit > 0 && decoded[limit - 1] != '/') {

to:

if (limit > pos && decoded[limit - 1] != '/') {

which was first a little confusing to me. But in the end it turns out
that this is semantically the same, because the
CharsetDecoder.decode() method called before is guaranteed to return a
"newly-allocated character buffer" and its "position will be zero and
its limit will follow the last character written". This also explains
why you don't have to take the CharBuffer's "arrayOffset()" into
account if you use the CharBuffer's backing array (because it will
always be 0 for newly created buffers). So maybe you can put in some
comments to make it less confusing for the ingenuous reader:

CharBuffer cb = decoder().decode(ByteBuffer.wrap(a, off, end - off));
// 'cb' is a newly allocated CharBuffer with 'pos == 0'
int pos = cb.position();
int limit = cb.limit();
char[] decoded;
if (cb.hasArray()) {
 // 'cb.arrayOffset()' is zero for newly allocated CharBuffers
 decoded = cb.array();
} else {
 decoded = new char[limit - pos];
 cb.get(decoded);
}

I think you can also remove the "else" branch (and maybe replace it
with an assertion) because newly allocated CharBuffers are guaranteed
to be backed by an array with array offset zero (see
https://docs.oracle.com/en/java/javase/14/docs/api/java.base/java/nio/CharBuffer.html#allocate(int)
).


Yes, it does seem the specification is pretty strong here and we can
assume that pos == 0, arrayOffset == 0 and cb.hasArray() == true.

I'll simplify to:

// 'cb' is a newly allocated CharBuffer with 'pos == 0',
// 'arrayOffset == 0', backed by an array.
CharBuffer cb = decoder().decode(ByteBuffer.wrap(a, off, 
end - off));

int limit = cb.limit();
char[] decoded = cb.array();
for (int i = 0; i < limit; i++) {
h = 31 * h + decoded[i];
}
if (limit > 0 && decoded[limit - 1] != '/') {
h = 31 * h + '/';
}

An assert seems like overkill.



Zipcoder.get() seems to be the only remaining if block without braces.
Maybe you'll wnat to fix that once your on it?

 public static ZipCoder get(Charset charset) {
 if (charset == UTF_8.INSTANCE)
 return UTF8;
 return new ZipCoder(charset);


Sure.

/Claes



Thumbs up from my side. There's no need for a new webrev from my side.

Best regards,
Volker


Testing: tier1-4

Thanks!

/Claes

[1]
Baseline:
Benchmark   (size) Mode Cnt   Score   Error Units
ZipFileGetEntry.getEntryHit  512  avgt   15  126.264 ± 5.297
   ns/op
ZipFileGetEntry.getEntryHit 1024  avgt   15  130.823 ± 7.212
   ns/op
ZipFileGetEntry.getEntryHitUncached  512  avgt   15  152.149 ± 4.978
   ns/op
ZipFileGetEntry.getEntryHitUncached 1024  avgt   15  151.527 ± 4.054
   ns/op

open.01:
Benchmark (size)  Mode  CntScore   Error
   Units
ZipFileGetEntry.getEntryHit  512  avgt   15   84.450 ± 5.474
   ns/op
ZipFileGetEntry.getEntryHit 1024  avgt   15   85.224 ± 3.776
   ns/op

Re: RFR: 8243469: Lazily encode name in ZipFile.getEntryPos

2020-04-27 Thread Volker Simonis
On Sun, Apr 26, 2020 at 11:34 PM Claes Redestad
 wrote:
>
> Hi again,
>
> On 2020-04-24 21:22, Claes Redestad wrote:
> >> It seems that 'getEntryHitUncached' is getting slightly slower with
> >> your change while all the other variants get significantly faster. I
> >> don't think that's a problem, but do you have an explanation why
> >> that's the case?
> >
> > I've noticed it swing a bit either way, and have been asking myself the
> > same thing. After a little analysis I think it's actually a bug in my
> > microbenchmark: I'm always looking up the same entry, and thus hitting
> > the same bucket in the hash table. If that one has a collision, we'll do
> > a few extra passes. If not, we won't. This might be reflected as a
> > significant swing in either direction.
> >
> > I'm going to try rewriting it to consider more (if not all) entries in
> > the zip file. That should mean the cost averages out a bit.
>
> after I improved my micro to root out sources of variance, the
> performance issue for hits persisted.
>
> Luckily Eirik had a brilliant idea: Why not decode the bytes in the
> cen to a String and compare that, rather than the other way around?
> To some surprise it turns out this gives us about a ~1.2x speedup for
> getEntryHit and getEntryHitUncached over open.00 - and comfortably
> just ahead of the baseline on getEntryHitUncached[1]. It also leads to
> slightly cleaner code[2].
>
> Webrev: http://cr.openjdk.java.net/~redestad/8243469/open.01/
>
> The speed-up appears to come from String.equals, which is intrinsified
> and significantly faster than the replaced loop. I profiled allocation
> per operation and it stays the same (EA removes the String).
>

Great! Another nice improvement. The changes look good to me.
Following just two minor remarks:

In ZipCoder.normalizedHashDecode() you've changed the line:

if (limit > 0 && decoded[limit - 1] != '/') {

to:

if (limit > pos && decoded[limit - 1] != '/') {

which was first a little confusing to me. But in the end it turns out
that this is semantically the same, because the
CharsetDecoder.decode() method called before is guaranteed to return a
"newly-allocated character buffer" and its "position will be zero and
its limit will follow the last character written". This also explains
why you don't have to take the CharBuffer's "arrayOffset()" into
account if you use the CharBuffer's backing array (because it will
always be 0 for newly created buffers). So maybe you can put in some
comments to make it less confusing for the ingenuous reader:

CharBuffer cb = decoder().decode(ByteBuffer.wrap(a, off, end - off));
// 'cb' is a newly allocated CharBuffer with 'pos == 0'
int pos = cb.position();
int limit = cb.limit();
char[] decoded;
if (cb.hasArray()) {
// 'cb.arrayOffset()' is zero for newly allocated CharBuffers
decoded = cb.array();
} else {
decoded = new char[limit - pos];
cb.get(decoded);
}

I think you can also remove the "else" branch (and maybe replace it
with an assertion) because newly allocated CharBuffers are guaranteed
to be backed by an array with array offset zero (see
https://docs.oracle.com/en/java/javase/14/docs/api/java.base/java/nio/CharBuffer.html#allocate(int)
).

Zipcoder.get() seems to be the only remaining if block without braces.
Maybe you'll wnat to fix that once your on it?

public static ZipCoder get(Charset charset) {
if (charset == UTF_8.INSTANCE)
return UTF8;
return new ZipCoder(charset);

Thumbs up from my side. There's no need for a new webrev from my side.

Best regards,
Volker

> Testing: tier1-4
>
> Thanks!
>
> /Claes
>
> [1]
> Baseline:
> Benchmark   (size) Mode Cnt   Score   Error Units
> ZipFileGetEntry.getEntryHit  512  avgt   15  126.264 ± 5.297
>   ns/op
> ZipFileGetEntry.getEntryHit 1024  avgt   15  130.823 ± 7.212
>   ns/op
> ZipFileGetEntry.getEntryHitUncached  512  avgt   15  152.149 ± 4.978
>   ns/op
> ZipFileGetEntry.getEntryHitUncached 1024  avgt   15  151.527 ± 4.054
>   ns/op
>
> open.01:
> Benchmark (size)  Mode  CntScore   Error
>   Units
> ZipFileGetEntry.getEntryHit  512  avgt   15   84.450 ± 5.474
>   ns/op
> ZipFileGetEntry.getEntryHit 1024  avgt   15   85.224 ± 3.776
>   ns/op
> ZipFileGetEntry.getEntryHitUncached  512  avgt   15  140.448 ± 4.667
>   ns/op
> ZipFileGetEntry.getEntryHitUncached 1024  avgt   15  145.046 ± 7.363
>
> [2] I stopped short of taking the cleanup a step further by decoding to
> String even in initCEN, which sadly isn't performance neutral:
>
> http://cr.openjdk.java.net/~redestad/8243469/open.01.init_decode/
>
> Something for the future to consider, maybe.


Re: RFR: 8243469: Lazily encode name in ZipFile.getEntryPos

2020-04-26 Thread Claes Redestad

Hi again,

On 2020-04-24 21:22, Claes Redestad wrote:

It seems that 'getEntryHitUncached' is getting slightly slower with
your change while all the other variants get significantly faster. I
don't think that's a problem, but do you have an explanation why
that's the case?


I've noticed it swing a bit either way, and have been asking myself the
same thing. After a little analysis I think it's actually a bug in my
microbenchmark: I'm always looking up the same entry, and thus hitting
the same bucket in the hash table. If that one has a collision, we'll do
a few extra passes. If not, we won't. This might be reflected as a
significant swing in either direction.

I'm going to try rewriting it to consider more (if not all) entries in
the zip file. That should mean the cost averages out a bit.


after I improved my micro to root out sources of variance, the
performance issue for hits persisted.

Luckily Eirik had a brilliant idea: Why not decode the bytes in the
cen to a String and compare that, rather than the other way around?
To some surprise it turns out this gives us about a ~1.2x speedup for
getEntryHit and getEntryHitUncached over open.00 - and comfortably
just ahead of the baseline on getEntryHitUncached[1]. It also leads to
slightly cleaner code[2].

Webrev: http://cr.openjdk.java.net/~redestad/8243469/open.01/

The speed-up appears to come from String.equals, which is intrinsified
and significantly faster than the replaced loop. I profiled allocation
per operation and it stays the same (EA removes the String).

Testing: tier1-4

Thanks!

/Claes

[1]
Baseline:
Benchmark   (size) Mode Cnt   Score   Error Units
ZipFileGetEntry.getEntryHit  512  avgt   15  126.264 ± 5.297 
 ns/op
ZipFileGetEntry.getEntryHit 1024  avgt   15  130.823 ± 7.212 
 ns/op
ZipFileGetEntry.getEntryHitUncached  512  avgt   15  152.149 ± 4.978 
 ns/op
ZipFileGetEntry.getEntryHitUncached 1024  avgt   15  151.527 ± 4.054 
 ns/op


open.01:
Benchmark (size)  Mode  CntScore   Error 
 Units
ZipFileGetEntry.getEntryHit  512  avgt   15   84.450 ± 5.474 
 ns/op
ZipFileGetEntry.getEntryHit 1024  avgt   15   85.224 ± 3.776 
 ns/op
ZipFileGetEntry.getEntryHitUncached  512  avgt   15  140.448 ± 4.667 
 ns/op

ZipFileGetEntry.getEntryHitUncached 1024  avgt   15  145.046 ± 7.363

[2] I stopped short of taking the cleanup a step further by decoding to
String even in initCEN, which sadly isn't performance neutral:

http://cr.openjdk.java.net/~redestad/8243469/open.01.init_decode/

Something for the future to consider, maybe.


Re: RFR: 8243469: Lazily encode name in ZipFile.getEntryPos

2020-04-24 Thread Claes Redestad




On 2020-04-24 20:07, Lance Andersen wrote:

Hi Claes,

Thank you for continuing to improve the performance in this part of 
java.util.zip.  The changes I think look pretty good.


Best
Lance



Thanks for the review, Lance!

Hopefully a few more improvements to come..

/Claes


Re: RFR: 8243469: Lazily encode name in ZipFile.getEntryPos

2020-04-24 Thread Claes Redestad

Hi Volker,

On 2020-04-24 20:37, Volker Simonis wrote:

On Thu, Apr 23, 2020 at 2:35 PM Claes Redestad
 wrote:


Hi,

current implementation of ZipFile.getEntryPos takes the encoded byte[]
of the String we're looking up. Which means when looking up entries
across multiple jar files, we allocate the byte[] over and over for each
jar file searched.

If we instead refactor the internal hash table to use a normalized
String-based hash value, we can almost always avoid both the repeated
encoding and (most of) the hash calculation when the entry is not found
in the jar/zip.

This realizes a significant startup improvement on applications with
several or many jar files. A rather typical case. For example I see a
25% speedup of ZipEnty.getEntry calls on a Spring PetClinic, which
reduces total startup time by ~120ms, or ~2.5% of total, on my setup. At
the same time remaining neutral on single-jar apps.

Webrev: http://cr.openjdk.java.net/~redestad/8243469/open.00/
Bug:https://bugs.openjdk.java.net/browse/JDK-8243469

Testing: tier1-2


Hi Claes,

that's a little puzzling change but also a nice enhancement and cleanup.


first: thanks for reviewing!

Yes, I've tried to simplify and explain everything clearly, since it's a
rather delicate area and we're already stretching the complexity budget
thin.



I think it looks good. I have only two minor comments:

There's no check for "end > 0" here:
   93 @Override
   94 boolean hasTrailingSlash(byte[] a, int end) {
   95 return a[end - 1] == '/';
   96 }
I think that's currently no real problem, but maybe put in a check for any case?


Will do.



And while you're on it, I think the following comment should be updated from:

  641 /* Checks ensureOpen() before invoke this method */

to something like:

  641 /* Check ensureOpen() before invoking this method */


Will do.




I've also had a quick look at the microbenchmark which you've
apparently only added today :)

It seems that 'getEntryHitUncached' is getting slightly slower with
your change while all the other variants get significantly faster. I
don't think that's a problem, but do you have an explanation why
that's the case?


I've noticed it swing a bit either way, and have been asking myself the
same thing. After a little analysis I think it's actually a bug in my
microbenchmark: I'm always looking up the same entry, and thus hitting
the same bucket in the hash table. If that one has a collision, we'll do
a few extra passes. If not, we won't. This might be reflected as a
significant swing in either direction.

I'm going to try rewriting it to consider more (if not all) entries in
the zip file. That should mean the cost averages out a bit.

/Claes



Thanks for this nice improvement,
Volker



Thanks!

/Claes


Re: RFR: 8243469: Lazily encode name in ZipFile.getEntryPos

2020-04-24 Thread Volker Simonis
On Thu, Apr 23, 2020 at 2:35 PM Claes Redestad
 wrote:
>
> Hi,
>
> current implementation of ZipFile.getEntryPos takes the encoded byte[]
> of the String we're looking up. Which means when looking up entries
> across multiple jar files, we allocate the byte[] over and over for each
> jar file searched.
>
> If we instead refactor the internal hash table to use a normalized
> String-based hash value, we can almost always avoid both the repeated
> encoding and (most of) the hash calculation when the entry is not found
> in the jar/zip.
>
> This realizes a significant startup improvement on applications with
> several or many jar files. A rather typical case. For example I see a
> 25% speedup of ZipEnty.getEntry calls on a Spring PetClinic, which
> reduces total startup time by ~120ms, or ~2.5% of total, on my setup. At
> the same time remaining neutral on single-jar apps.
>
> Webrev: http://cr.openjdk.java.net/~redestad/8243469/open.00/
> Bug:https://bugs.openjdk.java.net/browse/JDK-8243469
>
> Testing: tier1-2

Hi Claes,

that's a little puzzling change but also a nice enhancement and cleanup.

I think it looks good. I have only two minor comments:

There's no check for "end > 0" here:
  93 @Override
  94 boolean hasTrailingSlash(byte[] a, int end) {
  95 return a[end - 1] == '/';
  96 }
I think that's currently no real problem, but maybe put in a check for any case?

And while you're on it, I think the following comment should be updated from:

 641 /* Checks ensureOpen() before invoke this method */

to something like:

 641 /* Check ensureOpen() before invoking this method */


I've also had a quick look at the microbenchmark which you've
apparently only added today :)

It seems that 'getEntryHitUncached' is getting slightly slower with
your change while all the other variants get significantly faster. I
don't think that's a problem, but do you have an explanation why
that's the case?

Thanks for this nice improvement,
Volker

>
> Thanks!
>
> /Claes


Re: RFR: 8243469: Lazily encode name in ZipFile.getEntryPos

2020-04-24 Thread Lance Andersen
Hi Claes,

Thank you for continuing to improve the performance in this part of 
java.util.zip.  The changes I think look pretty good. 

Best
Lance

> On Apr 23, 2020, at 8:34 AM, Claes Redestad  wrote:
> 
> Hi,
> 
> current implementation of ZipFile.getEntryPos takes the encoded byte[]
> of the String we're looking up. Which means when looking up entries
> across multiple jar files, we allocate the byte[] over and over for each
> jar file searched.
> 
> If we instead refactor the internal hash table to use a normalized
> String-based hash value, we can almost always avoid both the repeated
> encoding and (most of) the hash calculation when the entry is not found
> in the jar/zip.
> 
> This realizes a significant startup improvement on applications with
> several or many jar files. A rather typical case. For example I see a
> 25% speedup of ZipEnty.getEntry calls on a Spring PetClinic, which
> reduces total startup time by ~120ms, or ~2.5% of total, on my setup. At
> the same time remaining neutral on single-jar apps.
> 
> Webrev: http://cr.openjdk.java.net/~redestad/8243469/open.00/
> Bug:https://bugs.openjdk.java.net/browse/JDK-8243469
> 
> Testing: tier1-2
> 
> Thanks!
> 
> /Claes

 
  

 Lance Andersen| 
Principal Member of Technical Staff | +1.781.442.2037
Oracle Java Engineering 
1 Network Drive 
Burlington, MA 01803
lance.ander...@oracle.com 





RFR: 8243469: Lazily encode name in ZipFile.getEntryPos

2020-04-23 Thread Claes Redestad

Hi,

current implementation of ZipFile.getEntryPos takes the encoded byte[]
of the String we're looking up. Which means when looking up entries
across multiple jar files, we allocate the byte[] over and over for each
jar file searched.

If we instead refactor the internal hash table to use a normalized
String-based hash value, we can almost always avoid both the repeated
encoding and (most of) the hash calculation when the entry is not found
in the jar/zip.

This realizes a significant startup improvement on applications with
several or many jar files. A rather typical case. For example I see a
25% speedup of ZipEnty.getEntry calls on a Spring PetClinic, which
reduces total startup time by ~120ms, or ~2.5% of total, on my setup. At
the same time remaining neutral on single-jar apps.

Webrev: http://cr.openjdk.java.net/~redestad/8243469/open.00/
Bug:https://bugs.openjdk.java.net/browse/JDK-8243469

Testing: tier1-2

Thanks!

/Claes