Hi guys, this week I worked on some improvements for the insertion operations (add, delete, modify, etc), because since the last version, the performances are really bad (well, way slower than it was 6 months earlier).
The reason we have worst performance is simplein order to guarantee that the database is stable even if the server crashes, I have activated the transaction support. It's great, from the data acces POV (even after a crash, the base is restored at startup and we can access the data, which was not the case before). The drawback is that inserting data is now 10 times slower: with M8, I was able to inject 600 entries per second, I'm now leveling out at 50 per second - and it depends on the number of indexes). So I have modified the code yesterday to increase the performances. There are two things ot consider when injectong data in a BTree with transaction enabled : - the number of entries stored in memory before being committed into a journal - the number of entries stored in the journal before being flushed to the database Obviously, teh second number must be higher than the first number (it makes no sense to flush a journal which has not been updated yet). I onducted some basic tests on JDBM, with various values. Here are some of the results : Old JDBM, no TM : ----------------- 13130ms transaction disabled 3770ms transaction disabled 3768ms transaction disabled 3756ms transaction disabled 3678ms transaction disabled 3678ms transaction disabled 3667ms transaction disabled 3687ms transaction disabled 3732ms transaction disabled 3721ms transaction disabled average 3731ms JDBM2, no TM : -------------- 12339ms transaction disabled 4087ms transaction disabled 4081ms transaction disabled 4050ms transaction disabled 4015ms transaction disabled 4003ms transaction disabled 4003ms transaction disabled 3973ms transaction disabled 4003ms transaction disabled 3986ms transaction disabled average 4022ms (8% slower) old JDBM, with a transaction manager (TM) : ------------------------------------------- Here, CI = commits, and Sync = flush on disk. 18130ms with CI, Sync [1,1] 14448ms with CI, Sync [1,2] 12934ms with CI, Sync [1,5] 12264ms with CI, Sync [1,10] 11845ms with CI, Sync [1,20] 11481ms with CI, Sync [1,50] 11499ms with CI, Sync [1,100] 11423ms with CI, Sync [1,200] 11487ms with CI, Sync [1,500] 11481ms with CI, Sync [1,1000] 11534ms with CI, Sync [1,2000] 11471ms with CI, Sync [1,5000] 11550ms with CI, Sync [1,10000] 11300ms with CI, Sync [2,2] 9923ms with CI, Sync [2,5] 9177ms with CI, Sync [2,10] 8672ms with CI, Sync [2,20] 8454ms with CI, Sync [2,50] 1922030ms with CI, Sync [2,100] 8530ms with CI, Sync [2,200] 8324ms with CI, Sync [2,500] 8363ms with CI, Sync [2,1000] 8381ms with CI, Sync [2,2000] 8417ms with CI, Sync [2,5000] 8361ms with CI, Sync [2,10000] 7772ms with CI, Sync [5,5] 7191ms with CI, Sync [5,10] 6737ms with CI, Sync [5,20] 6503ms with CI, Sync [5,50] 6398ms with CI, Sync [5,100] 6398ms with CI, Sync [5,200] 6448ms with CI, Sync [5,500] 6428ms with CI, Sync [5,1000] 6371ms with CI, Sync [5,2000] 6385ms with CI, Sync [5,5000] 6422ms with CI, Sync [5,10000] 6348ms with CI, Sync [10,10] 5844ms with CI, Sync [10,20] 5565ms with CI, Sync [10,50] 5473ms with CI, Sync [10,100] 5471ms with CI, Sync [10,200] 5453ms with CI, Sync [10,500] 5491ms with CI, Sync [10,1000] 5517ms with CI, Sync [10,2000] 5950ms with CI, Sync [10,5000] 5806ms with CI, Sync [10,10000] 5054ms with CI, Sync [20,20] 4769ms with CI, Sync [20,50] 4657ms with CI, Sync [20,100] 4663ms with CI, Sync [20,200] 4650ms with CI, Sync [20,500] 4622ms with CI, Sync [20,1000] 4664ms with CI, Sync [20,2000] 4634ms with CI, Sync [20,5000] 4662ms with CI, Sync [20,10000] 4243ms with CI, Sync [50,50] 4134ms with CI, Sync [50,100] 4123ms with CI, Sync [50,200] 4107ms with CI, Sync [50,500] 4162ms with CI, Sync [50,1000] 4115ms with CI, Sync [50,2000] 4105ms with CI, Sync [50,5000] 4143ms with CI, Sync [50,10000] 3982ms with CI, Sync [100,100] 3926ms with CI, Sync [100,200] 3913ms with CI, Sync [100,500] 3948ms with CI, Sync [100,1000] 3942ms with CI, Sync [100,2000] 3934ms with CI, Sync [100,5000] 3936ms with CI, Sync [100,10000] 3928ms with CI, Sync [200,200] 3909ms with CI, Sync [200,500] 4116ms with CI, Sync [200,1000] 4016ms with CI, Sync [200,2000] 3919ms with CI, Sync [200,5000] 3892ms with CI, Sync [200,10000] 3797ms with CI, Sync [500,500] 3798ms with CI, Sync [500,1000] 3814ms with CI, Sync [500,2000] 3790ms with CI, Sync [500,5000] 3818ms with CI, Sync [500,10000] 3725ms with CI, Sync [1000,1000] 3786ms with CI, Sync [1000,2000] 3736ms with CI, Sync [1000,5000] 3769ms with CI, Sync [1000,10000] 3682ms with CI, Sync [2000,2000] 3678ms with CI, Sync [2000,5000] 3691ms with CI, Sync [2000,10000] 3459ms with CI, Sync [5000,5000] 3493ms with CI, Sync [5000,10000] 3277ms with CI, Sync [10000,10000] As we can see, the more we differ the writings, the better the performances. So I ajusted the parameters on the server, and I currently do a commit every 2000 changes, and a flsuh every 4000 changes. It's hard coded atm, but I will make it configurable. The performances went up from 50 add per second to 185 add per second. Good, but nothing to get excited with, compared to the 600 add/s we had :/ With such numbers, it will take roughly 1h30 to inject 1M entries... What can we do to improve those performances then ? Atm, with JDBM, I see no bright future. There are ways to squeeze a bit of performance though, but not a lot : we can bypass the full chain, and inject the data directly into the btrees, but the gain will be marginal. So here are a few possibities : 1) Add raw injection of data feature in JDBM The idea is to order the data before injecting them, and apply an algoithm that build the pages, without having to pass through the tree reordering. If we can have N elements in each page, we order the data, and we store N elements in each page keeping the largest element of each page in a separate list. This is for the leaves. Then reproduce the same procssing on the gathered list, but for parent pages - and link to the leaves. etc, up to the root. This is the fastest way to store data in a btree. This requires a few operations : - order the data on disk : an O(N x log(N)) operation - modify JDBM to allow us to inject data in a page directly I evaluate this to two or three weeks of work. The gan will be quite big, the slower operation being the ordering of data Although we are stuck with JDBM lack of globa transaction system, which forces us to lock every operatio (including searches) while we add some entries. 2) Switch to JDBM 4 We have to evaluate what JDBM 4 is briging to us. It's obviously a better base, but we don't know to what extent, and we don't know if : - we can have global corss-btrees transactions - we can inject raw data in the files without going through the btree operations Evaluating JDBM4 is probably a week of work 3) Switch to Mavibot Mavibot is a labs I started last year, which implements a MVCC Btree in memory. It's working, t's fast, but it's in memory only atm (although the data are regularly flushed on disk). We need to improve it so that we add those features : - Make it backed with a physical layer that access data on disk, and avoid storing everything in memory - Add a raw load feature - Add a global transaction system, cross trees. The last feature is probably easier to implement, as we do have a version associated with each new modification : if we retin the version we started with, we can gather all the btree's revision, and validate them all in one shot, or rollback them all in one shot (a revision is visible only if it has been validated). One very interestig aspect of Mavibot is that you can update the btrees concurrently with searches : searches are done on older revisions, which don't interract with the pages being added. I think there is one to two more month of work on Mavibot to get it to the level we need. Proposal -------- IMO, we should investigate JDBM4 asap. It might solve many of the issues we have, at a low price. Improving the existing JDBM is just a waste of our precious time. I the long term, I would favor Mavibot for a couple of reasons : - first, it's a MVCC base, so we can conduct concurrent searches and modifications (although we can't do many modifications at the same time). - second, we have a complete understanding of the code : we wrote it. It's easier to understand the impact of the modifications we hve done on it. - third, we can have a dedicated backend based on the current Mavibot code, where all the data are in memory (that would work with a few tens of thousands entries, and the performances will be blazing fast). I'd like to know what's your opinion. Thanks ! -- Regards, Cordialement, Emmanuel Lécharny www.iktek.com
