AVL vs Array, and DIRSERVER-1377

Emmanuel Lecharny Thu, 09 Jul 2009 16:24:01 -0700

Hi guys,

I've been working with Kiran for 3 weeks trying to fix DIRSERVER-1377. Ithink we have eliminated many differentt erros, and added some neededsynchronization code.

Right now, I can say that I was able to run the server using amulti-threaded client creating and deleting the same entry (on enetryper thread) without problem (I have done 2 000 000 updates, with 20threads).

But the oneLevel index is broken soon after the test has started, and Ican reproduce this behavior quite easily, after only a few thousands ofloops.

So i'm still digging, and now, my perception is that we have a problemin the current AVL tree implementation. I'm not very proud to admit thatI have spent more than one week trying to implement my own version ofthose tree, using a damn smart algorithm, all iterative, with a hell ofoptimisation and "pointers" all over the nodes. What a *fool* !

Not only it was useless, but the code I ended with, more than 4500 linesof code, is not less likely to be bug free than the existing code. NIHsyndrom, I guess.

Anyway, yesterday, I had a kind of illumination in the dark (and smellycode I'm gently floating in ...). Do we need to use AVLTrees ?


Let's see the pros and cons.

pros :

o From the academic POV, a very interesting data structure. A cool oneto evaluate your students, as soon as they don't ask you to write the codeo From the theorical POV : search in 1.44 O( log( N ) ), add and deletein O ( Log( N ) ) too : quite efficient

o From the ASF POV : it could be promoted to commons-collection

cons :
o A hell to implement correctly
o A hell to test thoroughly
o 4500 line sof code : a hell to maintain

o Costly memory consumption (4 references to other nodes : left andright, next and previou, a balance flag, a reference to the value)

o Complicated addition, even more complicated deletion
o Complicated serialization

o Almost impossible to enter in the code without a solid understandingof the algorithm, a few existing decent web references.


So should we continue to use them ? Is there an alternative ?


No. And *Yes*.

And a damn good alternative : a simple array of sorted elements.

Why do we immediatly think about complex data structure when simple oneare offering all what we need ?

Ok, what do we need ? A place to store a set of ordered values. Thiosevalues can be almost any kind of objects, and we may have millions ofthem. Let's consider the case we just store 512 objects max, as we willuse another data structure on disk when we are above thise number.

A Object[] is all what we need, assuming it's sorted. And sorting suchan array is easy, as we just have to find the place where we will insertthe new value, copy some part of the array to the right, and insert thedata. O(Log(N)) for the search, plus a copy of all the values above thegiven one.Removing an element : same thing : find it, move all the elements on theleft one slot, and we are done Searching : O(log(N)), as the array issorted (use a dichotomic alogorithm. Quite easy)

Memory ? We just store an array of references to Objects, plus someextra size. Let's say we start with a 16 slots array. Not to much spaceis wasted. If we compare that with an AvlNode, when having more than 3nodes, we save memory using an array even slightly too big.


Serialization is *way* easier, so is deserialization.

The whole code should be no more than a few hundred of commented andjavadoced lines of code.

Last, not least, it's likely to be *faster* than the AvlTree for thebasic search operation.

I'm almost done with this new implementation, just wanted to inform youabout those ideas. And I also think that it will fix DIRSERVER-1377.

Last, not least, I will be MIA for 5 days - unless I find some cheapinternet connection).


Thanks !

PS : Any thoughts ?

--
--
cordialement, regards,
Emmanuel Lécharny
www.iktek.com
directory.apache.org

AVL vs Array, and DIRSERVER-1377

Reply via email to