XBean and scanning performance

David Blevins Sun, 15 Apr 2012 13:22:26 -0700

(decision and 4 choices at the bottom -- feedback requested)

I did some studying of the zip file format and determined that part of the 
reworked xbean-finder Archive API was plain wrong.


Using maps as an analogy here is how we were effectively scanning zips (jars):

    "Style A"

    Map<String, InputStream> zip = new HashMap<String, InputStream>();
    for (String entryName : zip.keySet()) {
        InputStream inputStream = zip.get(entryName);
        // scan the stream
    }

While there is some indexing in a zip file in what is called the central 
directory, it isn't nearly good enough to support this type of random access.  
The actual reading is done in C code when a zip file is randomly accessed in 
this way, but basically it seems about as slow as starting at the beginning of 
a stream and reading ahead in the stream until the index is hit and then 
reading for "real".  I doubt it's doing exactly that as in C code you should be 
able to start in the middle of a file, but let's put it this way... at the very 
minimum you are reading the Central Directory each and every single random 
access.

I've reworked the Archive API so that when you iterate over it, you iterate 
over actual entries.  Using map again as an analogy it looks like this now:

    "Style B"

    for (Map.Entry<String, InputStream> entry : zip.entrySet()) {
        String className = entry.getKey();
        InputStream inputStream = entry.getValue();
        // scan the stream
    }


Using Altassian Confluence as a driver to benchmark only the call to 'new 
AnnotationFinder(archive)' which is where our scanning happens, here are the 
results before (style A) and after (style b):


  StyleA: 8.89s - 9.02s
  StyleB: 3.33s - 3.52s

Now unfortunately the 'link()' call used to resolve parent classes that are not 
in the jars scanned as well as to resolve meta-annotations still needs the 
StyleA random access.  These things don't involve going in "jar order", but 
definitely are random access.  With the new and improved code that scans 
Confluence at around 3.4s, here is the time with 'link()' added

  StyleB scan + StyleA link: 15.61s - 15.75s

That link() call adds another 12 seconds.  Roughly equivalent to the cost of 4 
more scans.

So the good news is we don't need the link.  We very much like the link, but we 
don't need the link for Java EE 6 certification.  We have two very excellent 
features associated with that linking.

  - Meta-Annotations
  - Discovery JAX-RS of non-annotated Application subclasses (Application is a 
concrete class you subclass, like HttpServlet)

We have more or less 4 kinds of choices on how we deal with this:

  1. Link() is always called.  (always slow, extra features always enabled)
  2. Link() can be disabled but is enabled by default.   (slow, w/optional fast 
flag, extra features enabled by default)
  3. Link() can be enabled but is disabled by default.   (fast, w/optional slow 
flag, extra features disabled by default)
  4. Link is never enabled.  (always fast, extra features permanently disabled)


Thoughts, preferences?


-David

XBean and scanning performance

Reply via email to