More experiments.
I've created a domain model for storing test reports, and populated H2
with actual data to see how it behaves. The code is in [1].
It uses Hibernate via JPA. It defines several indices, too.
The test data is the actual JUnit report execution of Jenkins trunk
build. This contains about 4000 tests in 1000 suites. In our current
persisted XML format it takes up 2MB of XML.
I've loaded this into database 1000 times. That is, the total data
stored is 4M test records + 1M suite, simulating 1000 virtual builds.
On this data set, the database occupies 2.3GB on disk, which is on par
with 1000x2MB. So that looks OK.
On memory, the database occupies about 13MB. Looking at the heap dump,
much of it appears to be a cache, which would presumably shrink well
(and yes, when I reopen this database, the footprint drops down to 2MB
before I run any query.)
The queries run fine. I searched all the test cases in
'hudson.EnvVarsTest', and it took 100ms to load all 1000 records correctly.
In comparison, when I run a query that requires full table scan (after
dropping kernel page cache), it took 50 secs.
To me it looks good so far.
I'm going to start loading another 9000 builds before I head home to
expand the data set to simulate 10K builds, but given the performance
characteristics thus far, I expect it to work well, too.
[1]
https://github.com/jenkinsci/database-h2-plugin/tree/large-data-experiment
On 12/12/2012 10:56 AM, Kohsuke Kawaguchi wrote:
I've added HSQLDB and Derby to the test to do some more comparisons. The
code is in https://github.com/kohsuke/many-db
My findings follow:
First, we can safely eliminate Derby from the picture. It takes up about
4MB of heap for database (10K records of about 80bytes/record.) And it's
painfully slow to insert.
H2 and HSQLDB feels comparable speed-wise. Each database in H2 takes up
about 720KB according to YourKit (or 420KB according to NetBeans Insane
tool, which I suspect missed some other references from the thread?)
HSQLDB uses 1.4MB a pop according to YourKit (or 2MB according to Insane.)
So I think the choice of H2 was reasonable. I also liked that the code
ships with debug symbols and source jars, making it very easy to debug
and find out what's going on.
2012/12/8 Kohsuke Kawaguchi <[email protected] <mailto:[email protected]>>
Database servers do often host a lot of databases in it, so I don't
think having 1000s of independent DBs is beyond the design
boundary. With proper cache management, I don't think we'll ever
have all 1000s of them open at the same time, and for those few that
would do, a few GB of heap isn't the end of the world.
That said, we can and should try a few more like HSQLDB to see if
they have different characteristics. It might be also possible to
make some quick improvements to H2 for some quick gains, for H2
probably isn't designed for 1000s of independent DBs in one JVM.
One benefit of SQL DB is the popularity and vast number of people
who are familiar with it, inluding users. For example, test data in
DB would allow users and other devs to come up with queries. It'll
also make it easier for existing plugin devs to use them.
MapDB is a nice library on its own, but Map is by definition single
index, so I'm not sure it works for many typical use cases. Say test
reports --- we need to be able to query all failing ones for a given
build as well as all the past executions of a specific test case.
2012/12/5 Jesse Glick <[email protected]
<mailto:[email protected]>>
On 12/05/2012 06:37 AM, Kohsuke Kawaguchi wrote:
H2 database, when opened, takes up about 1MB in heap.
Seems excessive when typical jobs will have much less data than
this that needs to be stored.
I am still not convinced that using a SQL database for this kind
of thing is appropriate.
1. Portability of SQL is a bit of a red herring because once you
start using, say, H2 to store per-job data, you cannot casually
switch to another DB without losing historical build records;
and Jenkins would have to ship with _some_ DB plugin, or all
plugins using the DB API would be broken. And for per-job DBs we
are narrowing the field to those that are embeddable, which
probably means just H2 in practice.
2. SQL databases are generally optimized for one slow-to-start
instance, a small number of expensive connections, and maybe
dozens of tables with lots of data. Whereas we need thousands of
extremely cheap instances, each with one immediately available
connection and a few tables with usually not so much data. The
closer we can get to java.io.RandomAccessFile.<__init> the better.
Is there any fully free (so not BDB-JE) DB which is pure Java,
embeddable, supports some kind of indices, supports compact
binary schemas (so not e.g. Lucene or the current wave of JSON
DBs), and has a very simple client API once the connection is
set up; while being openable from one or two disk files in say
under a millisecond with no significant penalty beyond the file
descriptor? MapDB [1] looks most promising so far.
[1] https://github.com/jankotek/__mapdb
<https://github.com/jankotek/mapdb>
--
Kohsuke Kawaguchi
--
Kohsuke Kawaguchi
--
Kohsuke Kawaguchi | CloudBees, Inc. | http://cloudbees.com/
Try Nectar, our professional version of Jenkins