Re: Improving ZipFile.getEntry performance using Bloom filters

Ioi Lam Tue, 14 Apr 2020 15:36:27 -0700



On 4/13/20 12:26 PM, Eirik Bjørsnøs wrote:

Hi,

While working on improvements to JarFile.getVersionedEntry, it occurred to
me that the missing lookups we are removing there may be just one example
of a more general issue.

To get a better understanding of how ZipFile.getEntry is used, I added some
PerfCounters for the number of hits and misses in calls to
ZipFile.Source.getEntryPos() and the elapsed run times for these hits and
misses.

To test this I used Spring PetClinic. I only care about startup time here,
since that's when most of the Zip lookups happens.

Here's what I found:

Hits 10776, misses: 744642
=> 1.4 percent of lookups are hits

I am a little worried adding extra overhead unconditionally into the JARreader that people may not need/want.

Maybe we should take a step back and see why there are so many misses?Is it because you have a long classpath with many JARs on it, and youend up searching a lot of JAR files unnecessarily?

If this is the case, I think converting the app to modules will give youthe speed up. A package can exist only in a single module so lookup isfast. You won't have misses unless you intentionally look for thingsthat don't exist.


Or, you can just package the app into one giant JAR file.


Thanks
- Ioi


Hits runtime: 7995508 ns, miss runtime: 328360900 ns
=> 2.4 percent of run time is spent on hits.

The startup time reported by Spring Petclinic was 6395 ms. 5 percent of
this startup time was spent on missed getEntry lookups.

If we can improve the performance of lookup misses, I think there is a good
potential for overall performance wins.

One idea I had about how this could be done was to use Bloom filters. Bloom
filters provide a fast, space-efficient, probabilistic data structure which
may be used to determine that an element is definitely not a member of a
set.

I made a sloppy Bloom filter implementation and updated
ZipFile.Source.getEntryPos to check this filter first and return -1 if the
name is definitely not in the jar.

This gave the following results for Spring Petclinic startup:

Hits run time: 10964825 ns, miss runtime: 233868876 ns

We see that while hits are now 1.4x slower, the total time spent on lookups
is only 73 percent compared to that of the baseline.

This translates to a Petclinic startup improvement of 91 ms or 1.4 percent
(assuming single-threaded class loading here).

I expect that this can be improved further with a more clever use of hash
functions. Specifically it would be great to skip earlier in case of a
miss, before the String to byte[] encoding happens. Also, I haven't
analysed the false positive rate in the bloom filter, so that's probably
open for some tuning.

I also expect lookup misses to be even more common on applications with
longer class paths and more complex class loader delegation.

Cheers,
Eirik.

Re: Improving ZipFile.getEntry performance using Bloom filters

Reply via email to