[jira] (DIRSERVER-2116) ApacheDS failed to start after every reboot and throwing error ERR_250_ENTRY_ALREADY_EXISTS dc=example,dc=com already exists!

Emmanuel Lecharny (JIRA) Tue, 31 Jan 2017 01:55:43 -0800

Title: Message Title

DIRSERVER-2116

Dear Steve,

we knows that is frustrating. If it were simple, it would have been fixed 3 years ago. Ok,that does not help :/ Let me give you a bit of background for you to understand the complexity of this issue.

LDAP update requests ends up with B-tree being updated, many B-trees. For even a single attribute being changed in an existing entry. For instance, when you add an entry, we have to update all those indexes (B-trees) :

master table which contains the entry itself (but its DN)
RDN index - which will be updated as many times as we have RDNs in the DN, so if your entry is 'cn=johnDoe,ou=people,dc=example,dc=com', with 'dc=example,dc=com' being your partition's suffix, we will update the index 3 times, as the suffix is considered as one single RDN -
present index, which is used when you use filters like '(cn=*)'
ObjectClass index
entryCSN index (used for replication)
entryUUID index (each entry has a unique UUID across all the partitions)

and potentially more indexes, depending on the entry's content : alis, oneAlias, subAlias, administrativeRole. Add to that all the user's defined indexes.

That being said, you can understand that all of those index MUST be updated as a whole, or none of them should be written. We do need transactions for that purpose, and we don't have transaction, yet. Well, kind of.

JDBM, the main b-tree implementation we are binding ApacheDS with since day one has a very poor support for transactions, and worse than that, it does not support cross-b-tree transactions. So it ends up with a system that may break if we have a crash, or concurrent updates.

We have spent months trying to get this fix in a ugly way (adding locks all over the server, trying to leverage JDBM transaction to some extent, addinga repair mode...), but at the end, we knew that we would need something different. Using BerkeleyDB JE wasn't an option - even though Kiran wrote a partition for it -, for a simple reason : incompatible licenses...

So to put things in perspective, here is what we know we should do, and we knew it back in 2006 (yes, 11 years ago) for different reasons : we need a MVCC system that support cross b-tree transactions. It is what Mavibot was created for in 2012 (bottom line, we tried to implement MVCC on top of JDBM, which was not a real success). Here is a ref to the mail where we discussed about Mavibot : https://lists.apache.org/thread.html/72d9ba2fc6b567471464c385cdfcf80bf2882cd5e7825e3fa5bdc773@1340293586@%3Cdev.directory.apache.org%3E

For the record, we talked about MVCC in 2006 at Austin, where CouchDB was presented. Alex new we we going to need transaction in the server, and I realized that MVCC was the way to go for LDAP (concurrent reads with no locks, one single writer).

As a matter of fact, OpenLDAP and more specifically Howard Chu started to work on LMDB, which would become the de facto backend for OpenLDAP back in 2011. That was funny, because we didn't heard about it before the end of 2012... LMDB is exactly what Mavibot will be : MVCC with cross-b-tree transactions (exept that LMDB is a bit more than that, typically it's based on a Memory Mapped File and does not use cache, something we have to use in Mavibot, for the simple reason that serializing/deserializing objects in Java is way more complex than mapping a C struct on top of a byte[], and it's also way slower.

That being said, writing a new backend is not an easy thing. The very first version was released on 06/Jun/2013 (http://directory.apache.org/mavibot/download-old-versions.html), and it didn't had transaction support. Although it does work, and it is supported by ApacheDS. It even gives ApacheDS performance a real boost, with a 5x performance increase when it comes to updates.

So why didn't we make Mavibot the default database ? Well, simply because it was not ready for production :

the 'reclaimer' (see it as the database GC) was not functional yet. That makes the database growing very quickly, making it unusable in production - we are talking about Gb of data being accumulated for a few thousands of entries -
transaction weren't implemented properly

Bottom line, a new version of Mavibot is being design, and as a mater of fact, I rebooted the effort 3 weeks ago, and the current status is work in progress. Adding transactions is challenging, but the result is astonishing. Not that it makes the database safe against crashes - which is by design in MVCC anyway -, but because it makes LDAP data safe against a crash, even if it's in the middle of an update. But it also speed up updates enormously, as we don't write anything on disk before the txn is completed, saving potentially 70% of the writes needed for an update. That has the potential of speeding up ApacheDS by a factor 3, for updates (reads will not be impacted, well, sort of).

As I said, reads will not be impacted, but it's not exactly true. The current ApacheDS implementation uses locks that block reads when a write is done (same story : trying to make the dabase safe against concurrent access). Having MVCC with transaction would mean we can remove all thse stupid locks, speeding up reads, too (but to a lesser extend).

One last advantage : with Mavibot, we already have a bulk loader that can be use to create a database without having to load the entries one by one in ApacheDS. That means you will be able to inject 1M entries in a matter of seconds, instead of hours (or even days..).

Ok, now, were are we ? Not so far, but not ready yet. Even if I complete the Mavibot transaction implementation next week, we will need a lot of checks to be sure it's safe. It might take weeks, months.

But this is not the only problem : I do have a day job, and a family. I don't sleep a lot, but still, it's hard to work 1 hour on this proect between midnight and 1am in the morning, and to be efficient. Currently, I'm mainly programming on this stuff during week-ends.

Mavibot is not such a big piece of code (25 000 slocs), but it's a complex one. Integrating it in ApacheDS is not exatcly a piece of cake, but it's already partially done. Also keep in mind that ApacheDS is quite a big baby : 200 000 slocs for the server, 200 000 slocs for the LDAP API which it uses, 45 000 slocs for MINA, the underlying NIO framework, not mentionning the Studio's 200 000 slocs . We are facing an nearly 1 million slocs project...
It takes a hell of a time to correctly work this out in a way it does not break everything.

Last, not least, I'm doing this for the fun of it, and EVERYBODY is very welcome to join the effort, especially those depending on this piece of software. I mean, it's open source, and if it's broken, you can help fixing it. Whining does not really help, but I can understand the exasperation one can feel when a critical bug is postponed version after version...

So keep faith, and if you have time and energy, feel free to join teh effort !

Add Comment

This message was sent by Atlassian JIRA

[jira] (DIRSERVER-2116) ApacheDS failed to start after every reboot and throwing error ERR_250_ENTRY_ALREADY_EXISTS dc=example,dc=com already exists!

Reply via email to