More experiments.

I've created a domain model for storing test reports, and populated H2 with actual data to see how it behaves. The code is in [1].

It uses Hibernate via JPA. It defines several indices, too.

The test data is the actual JUnit report execution of Jenkins trunk build. This contains about 4000 tests in 1000 suites. In our current persisted XML format it takes up 2MB of XML.

I've loaded this into database 1000 times. That is, the total data stored is 4M test records + 1M suite, simulating 1000 virtual builds.

On this data set, the database occupies 2.3GB on disk, which is on par with 1000x2MB. So that looks OK.

On memory, the database occupies about 13MB. Looking at the heap dump, much of it appears to be a cache, which would presumably shrink well (and yes, when I reopen this database, the footprint drops down to 2MB before I run any query.)

The queries run fine. I searched all the test cases in 'hudson.EnvVarsTest', and it took 100ms to load all 1000 records correctly.

In comparison, when I run a query that requires full table scan (after dropping kernel page cache), it took 50 secs.

To me it looks good so far.


I'm going to start loading another 9000 builds before I head home to expand the data set to simulate 10K builds, but given the performance characteristics thus far, I expect it to work well, too.


[1] https://github.com/jenkinsci/database-h2-plugin/tree/large-data-experiment

On 12/12/2012 10:56 AM, Kohsuke Kawaguchi wrote:
I've added HSQLDB and Derby to the test to do some more comparisons. The
code is in https://github.com/kohsuke/many-db

My findings follow:

First, we can safely eliminate Derby from the picture. It takes up about
4MB of heap for database (10K records of about 80bytes/record.) And it's
painfully slow to insert.

H2 and HSQLDB feels comparable speed-wise. Each database in H2 takes up
about 720KB according to YourKit (or 420KB according to NetBeans Insane
tool, which I suspect missed some other references from the thread?)
HSQLDB uses 1.4MB a pop according to YourKit (or 2MB according to Insane.)

So I think the choice of H2 was reasonable. I also liked that the code
ships with debug symbols and source jars, making it very easy to debug
and find out what's going on.


2012/12/8 Kohsuke Kawaguchi <[email protected] <mailto:[email protected]>>

    Database servers do often host a lot of databases in it, so I don't
    think having 1000s of independent DBs is beyond the design
    boundary. With proper cache management, I don't think we'll ever
    have all 1000s of them open at the same time, and for those few that
    would do, a few GB of heap isn't the end of the world.

    That said, we can and should try a few more like HSQLDB to see if
    they have different characteristics. It might be also possible to
    make some quick improvements to H2 for some quick gains, for H2
    probably isn't designed for 1000s of independent DBs in one JVM.

    One benefit of SQL DB is the popularity and vast number of people
    who are familiar with it, inluding users. For example, test data in
    DB would allow users and other devs to come up with queries. It'll
    also make it easier for existing plugin devs to use them.

    MapDB is a nice library on its own, but Map is by definition single
    index, so I'm not sure it works for many typical use cases. Say test
    reports --- we need to be able to query all failing ones for a given
    build as well as all the past executions of a specific test case.




    2012/12/5 Jesse Glick <[email protected]
    <mailto:[email protected]>>

        On 12/05/2012 06:37 AM, Kohsuke Kawaguchi wrote:

            H2 database, when opened, takes up about 1MB in heap.


        Seems excessive when typical jobs will have much less data than
        this that needs to be stored.

        I am still not convinced that using a SQL database for this kind
        of thing is appropriate.

        1. Portability of SQL is a bit of a red herring because once you
        start using, say, H2 to store per-job data, you cannot casually
        switch to another DB without losing historical build records;
        and Jenkins would have to ship with _some_ DB plugin, or all
        plugins using the DB API would be broken. And for per-job DBs we
        are narrowing the field to those that are embeddable, which
        probably means just H2 in practice.

        2. SQL databases are generally optimized for one slow-to-start
        instance, a small number of expensive connections, and maybe
        dozens of tables with lots of data. Whereas we need thousands of
        extremely cheap instances, each with one immediately available
        connection and a few tables with usually not so much data. The
        closer we can get to java.io.RandomAccessFile.<__init> the better.

        Is there any fully free (so not BDB-JE) DB which is pure Java,
        embeddable, supports some kind of indices, supports compact
        binary schemas (so not e.g. Lucene or the current wave of JSON
        DBs), and has a very simple client API once the connection is
        set up; while being openable from one or two disk files in say
        under a millisecond with no significant penalty beyond the file
        descriptor? MapDB [1] looks most promising so far.

        [1] https://github.com/jankotek/__mapdb
        <https://github.com/jankotek/mapdb>




    --
    Kohsuke Kawaguchi




--
Kohsuke Kawaguchi


--
Kohsuke Kawaguchi | CloudBees, Inc. | http://cloudbees.com/
Try Nectar, our professional version of Jenkins

Reply via email to