Backend improvements

Emmanuel Lécharny Sat, 16 Feb 2013 02:19:48 -0800

Hi guys,

this week I worked on some improvements for the insertion operations
(add, delete, modify, etc), because since the last version, the
performances are really bad (well, way slower than it was 6 months earlier).


The reason we have worst performance is simplein order to guarantee that
the database is stable even if the server crashes, I have activated the
transaction support. It's great, from the data acces POV (even after a
crash, the base is restored at startup and we can access the data, which
was not the case before). The drawback is that inserting data is now 10
times slower: with M8, I was able to inject 600 entries per second, I'm
now leveling out at 50 per second - and it depends on the number of
indexes).

So I have modified the code yesterday to increase the performances.
There are two things ot consider when injectong data in a BTree with
transaction enabled :
- the number of entries stored in memory before being committed into a
journal
- the number of entries stored in the journal before being flushed to
the database

Obviously, teh second number must be higher than the first number (it
makes no sense to flush a journal which has not been updated yet).

I onducted some basic tests on JDBM, with various values. Here are some
of the results :

Old JDBM, no TM :
-----------------

13130ms transaction disabled
3770ms transaction disabled
3768ms transaction disabled
3756ms transaction disabled
3678ms transaction disabled
3678ms transaction disabled
3667ms transaction disabled
3687ms transaction disabled
3732ms transaction disabled
3721ms transaction disabled

average 3731ms

JDBM2, no TM :
--------------

12339ms transaction disabled
4087ms transaction disabled
4081ms transaction disabled
4050ms transaction disabled
4015ms transaction disabled
4003ms transaction disabled
4003ms transaction disabled
3973ms transaction disabled
4003ms transaction disabled
3986ms transaction disabled

average 4022ms (8% slower)

old JDBM, with a transaction manager (TM) :
-------------------------------------------

Here, CI = commits, and Sync = flush on disk.

18130ms with CI, Sync [1,1]
14448ms with CI, Sync [1,2]
12934ms with CI, Sync [1,5]
12264ms with CI, Sync [1,10]
11845ms with CI, Sync [1,20]
11481ms with CI, Sync [1,50]
11499ms with CI, Sync [1,100]
11423ms with CI, Sync [1,200]
11487ms with CI, Sync [1,500]
11481ms with CI, Sync [1,1000]
11534ms with CI, Sync [1,2000]
11471ms with CI, Sync [1,5000]
11550ms with CI, Sync [1,10000]

11300ms with CI, Sync [2,2]
9923ms with CI, Sync [2,5]
9177ms with CI, Sync [2,10]
8672ms with CI, Sync [2,20]
8454ms with CI, Sync [2,50]
1922030ms with CI, Sync [2,100]
8530ms with CI, Sync [2,200]
8324ms with CI, Sync [2,500]
8363ms with CI, Sync [2,1000]
8381ms with CI, Sync [2,2000]
8417ms with CI, Sync [2,5000]
8361ms with CI, Sync [2,10000]

7772ms with CI, Sync [5,5]
7191ms with CI, Sync [5,10]
6737ms with CI, Sync [5,20]
6503ms with CI, Sync [5,50]
6398ms with CI, Sync [5,100]
6398ms with CI, Sync [5,200]
6448ms with CI, Sync [5,500]
6428ms with CI, Sync [5,1000]
6371ms with CI, Sync [5,2000]
6385ms with CI, Sync [5,5000]
6422ms with CI, Sync [5,10000]

6348ms with CI, Sync [10,10]
5844ms with CI, Sync [10,20]
5565ms with CI, Sync [10,50]
5473ms with CI, Sync [10,100]
5471ms with CI, Sync [10,200]
5453ms with CI, Sync [10,500]
5491ms with CI, Sync [10,1000]
5517ms with CI, Sync [10,2000]
5950ms with CI, Sync [10,5000]
5806ms with CI, Sync [10,10000]

5054ms with CI, Sync [20,20]
4769ms with CI, Sync [20,50]
4657ms with CI, Sync [20,100]
4663ms with CI, Sync [20,200]
4650ms with CI, Sync [20,500]
4622ms with CI, Sync [20,1000]
4664ms with CI, Sync [20,2000]
4634ms with CI, Sync [20,5000]
4662ms with CI, Sync [20,10000]

4243ms with CI, Sync [50,50]
4134ms with CI, Sync [50,100]
4123ms with CI, Sync [50,200]
4107ms with CI, Sync [50,500]
4162ms with CI, Sync [50,1000]
4115ms with CI, Sync [50,2000]
4105ms with CI, Sync [50,5000]
4143ms with CI, Sync [50,10000]

3982ms with CI, Sync [100,100]
3926ms with CI, Sync [100,200]
3913ms with CI, Sync [100,500]
3948ms with CI, Sync [100,1000]
3942ms with CI, Sync [100,2000]
3934ms with CI, Sync [100,5000]
3936ms with CI, Sync [100,10000]

3928ms with CI, Sync [200,200]
3909ms with CI, Sync [200,500]
4116ms with CI, Sync [200,1000]
4016ms with CI, Sync [200,2000]
3919ms with CI, Sync [200,5000]
3892ms with CI, Sync [200,10000]

3797ms with CI, Sync [500,500]
3798ms with CI, Sync [500,1000]
3814ms with CI, Sync [500,2000]
3790ms with CI, Sync [500,5000]
3818ms with CI, Sync [500,10000]

3725ms with CI, Sync [1000,1000]
3786ms with CI, Sync [1000,2000]
3736ms with CI, Sync [1000,5000]
3769ms with CI, Sync [1000,10000]

3682ms with CI, Sync [2000,2000]
3678ms with CI, Sync [2000,5000]
3691ms with CI, Sync [2000,10000]

3459ms with CI, Sync [5000,5000]
3493ms with CI, Sync [5000,10000]
3277ms with CI, Sync [10000,10000]

As we can see, the more we differ the writings, the better the performances.

So I ajusted the parameters on the server, and I currently do a commit
every 2000 changes, and a flsuh every 4000 changes. It's hard coded atm,
but I will make it configurable.

The performances went up from 50 add per second to 185 add per second.
Good, but nothing to get excited with, compared to the 600 add/s we had
:/ With such numbers, it will take roughly 1h30 to inject 1M entries...

What can we do to improve those performances then ?


Atm, with JDBM, I see no bright future. There are ways to squeeze a bit
of performance though, but not a lot : we can bypass the full chain, and
inject the data directly into the btrees, but the gain will be marginal.

So here are a few possibities :

1) Add raw injection of data feature in JDBM

The idea is to order the data before injecting them, and apply an
algoithm that build the pages, without having to pass through the tree
reordering. If we can have N elements in each page, we order the data,
and we store N elements in each page keeping the largest element of each
page in a separate list. This is for the leaves. Then reproduce the same
procssing on the gathered list, but for parent pages - and link to the
leaves. etc, up to the root.

This is the fastest way to store data in a btree.

This requires a few operations :
- order the data on disk : an O(N x log(N)) operation
- modify JDBM to allow us to inject data in a page directly

I evaluate this to two or three weeks of work. The gan will be quite
big, the slower operation being the ordering of data

Although we are stuck with JDBM lack of globa transaction system, which
forces us to lock every operatio (including searches) while we add some
entries.

2) Switch to JDBM 4

We have to evaluate what JDBM 4 is briging to us. It's obviously a
better base, but we don't know to what extent, and we don't know if :
- we can have global corss-btrees transactions
- we can inject raw data in the files without going through the btree
operations

Evaluating JDBM4 is probably a week of work

3) Switch to Mavibot

Mavibot is a labs I started last year, which implements a MVCC Btree in
memory. It's working, t's fast, but it's in memory only atm (although
the data are regularly flushed on disk).

We need to improve it so that we add those features :
- Make it backed with a physical layer that access data on disk, and
avoid storing everything in memory
- Add a raw load feature
- Add a global transaction system, cross trees.

The last feature is probably easier to implement, as we do have a
version associated with each new modification : if we retin the version
we started with, we can gather all the btree's revision, and validate
them all in one shot, or rollback them all in one shot (a revision is
visible only if it has been validated).

One very interestig aspect of Mavibot is that you can update the btrees
concurrently with searches : searches are done on older revisions, which
don't interract with the pages being added.

I think there is one to two more month of work on Mavibot to get it to
the level we need.

Proposal
--------

IMO, we should investigate JDBM4 asap. It might solve many of the issues
we have, at a low price. Improving the existing JDBM is just a waste of
our precious time.

I the long term, I would favor Mavibot for a couple of reasons :
- first, it's a MVCC base, so we can conduct concurrent searches and
modifications (although we can't do many modifications at the same time).
- second, we have a complete understanding of the code : we wrote it.
It's easier to understand the impact of the modifications we hve done on it.
- third, we can have a dedicated backend based on the current Mavibot
code, where all the data are in memory (that would work with a few tens
of thousands entries, and the performances will be blazing fast).

I'd like to know what's your opinion.

Thanks !

-- 
Regards,
Cordialement,
Emmanuel Lécharny
www.iktek.com

Backend improvements

Reply via email to