Hi, While working on improvements to JarFile.getVersionedEntry, it occurred to me that the missing lookups we are removing there may be just one example of a more general issue.
To get a better understanding of how ZipFile.getEntry is used, I added some PerfCounters for the number of hits and misses in calls to ZipFile.Source.getEntryPos() and the elapsed run times for these hits and misses. To test this I used Spring PetClinic. I only care about startup time here, since that's when most of the Zip lookups happens. Here's what I found: Hits 10776, misses: 744642 => 1.4 percent of lookups are hits Hits runtime: 7995508 ns, miss runtime: 328360900 ns => 2.4 percent of run time is spent on hits. The startup time reported by Spring Petclinic was 6395 ms. 5 percent of this startup time was spent on missed getEntry lookups. If we can improve the performance of lookup misses, I think there is a good potential for overall performance wins. One idea I had about how this could be done was to use Bloom filters. Bloom filters provide a fast, space-efficient, probabilistic data structure which may be used to determine that an element is definitely not a member of a set. I made a sloppy Bloom filter implementation and updated ZipFile.Source.getEntryPos to check this filter first and return -1 if the name is definitely not in the jar. This gave the following results for Spring Petclinic startup: Hits run time: 10964825 ns, miss runtime: 233868876 ns We see that while hits are now 1.4x slower, the total time spent on lookups is only 73 percent compared to that of the baseline. This translates to a Petclinic startup improvement of 91 ms or 1.4 percent (assuming single-threaded class loading here). I expect that this can be improved further with a more clever use of hash functions. Specifically it would be great to skip earlier in case of a miss, before the String to byte[] encoding happens. Also, I haven't analysed the false positive rate in the bloom filter, so that's probably open for some tuning. I also expect lookup misses to be even more common on applications with longer class paths and more complex class loader delegation. Cheers, Eirik.