Re: Current status of the server, and mvbt-branch status

Pierre-Arnaud Marcelot Fri, 21 Sep 2012 03:05:57 -0700

Very interesting improvements.

Love the increase in performance and simplification in class hierarchy.


+1 for the merge.

Regards,
Pierre-Arnaud


On 19 sept. 2012, at 16:34, Emmanuel Lécharny <[email protected]> wrote:

> Last year in september, I detected a problem when conducting concurrent 
> searches and modifications [1]. This problem is a real show-stopper.
> 
> The problem was that as we were fetching the entries one by one, using 
> indexes, if some modification occurs and impacts the index we are using while 
> the search is being processed, we may not be able to get back the values we 
> are pointing to in the search. This results to NPE, at best.
> 
> We looked for ways to solve this issue, and a few of them are available.
> 1) The first solution would be to implement a in-memory MVCC system, with a 
> write-ahead log. The idea is to keep in memory all the modified values until 
> all the pending searches are completed. They can then be removed from memory. 
> This solution would be cross-partition.
> 2) The second solution would be to use MVCC backends. As we don't remove any 
> element unless they are not used anymore, this is a guarantee that we will 
> always be able to fetch an element from the backend, even if it has been 
> modified. We just will get back an old revision.
> 3) A third solution, which does not imply that we implement a MVCC global 
> system, or a MVCC backend, would be to serialize writes and reads. Reads 
> would be concurrent, and writes can only be done when all the pending reads 
> and writes are completed. Of course, if a read is received while a write is 
> being processed, this read will have to wait for the write completion. In 
> order to keep performance good, we need to speed up the reads so that writes 
> are not delayed for ever : this is done by computer the set of candidates 
> immediately, before fetching the entries one by one later. If an entry has 
> been removed whil we fetch it, then this entry will simply be ignored.
> 
> Those three solutions have pros and cons. We will analyze them in a further 
> paragraph.
> 
> I wanted to play around the 2nd idea, and as I needed a in-memory MVCC BTree 
> for that, I wrote a prototype BTree for that purpose (Mavibot [2]).
> 
> While I thought Mavibot was ready (more or less), I created a new branch 
> (apacheds-mvbt) to experiment the new backend. I had to do some cleanup in 
> order to be able to use the Mavibot implementation :
> 
> o Use UUID instead of Long for Entry's identifiers
> o Simplify the cursor implementation, and use of generic [3]
> 
> 
> After a few days, I found that some technical choice we have made are 
> problematic, and I went a bit further.
> 
> 1) The cursor Interface suppose you can move forward and backward. Sadly, 
> when you have a search using a filter with an 'OR' connector, there is no way 
> you can move backward. Not that it's really a problem in a standard usage of 
> LDAP, but that would defeat the VLV control.
> 
> 2) Not fetching entries immediately would potentially save time, if the entry 
> is not a valid candidate against the filter we use. However, using indexes to 
> check if an entry is valid or not mans we go down a BTree, which is also a 
> costly operation. If we have a filter with N indexed attributes, we will go 
> down N BTrees, for each candidate, instead of fetching the entry once, and 
> simply validate it against the filters using comparators.
> 
> 3) We were using generics extensively, in a way that makes it extremely 
> confusing to know what is what (having 3 parameters for an IndexCursor does 
> not help at all)
> 
> 4) We were using reverse table for every index, even when it was useless.
> 
> 5) The Index initialization was problematic
> 
> 6) Once we are using UUID instead of Long, there is no reason to ue the 
> entryUUID index anymore, we can do a full scan using the MasterTable directly
> 
> 7) The Cursor hierarchy was not fully consistent
> 
> 8) Entry were fetched more than once in complex filters, slowing down badly 
> the procssing of such searches
> 
> 9) Some evaluators could be removed sparing some CPU when the number of 
> candidate they were selecting was 0
> 
> 10) Some Or and And cursor could be removed if they were having only one 
> candidate
> 
> All those items have been fixed in the mvbt-branch.
> 
> Then, before implementing Mavibot, I found that we could implement the 3rd 
> strategy easily. The idea is to allow concurrent searches, but writes would 
> be serialized.
> 
> The problem was that with the way we were using cursors, we may browse a 
> BTree for a very long time - especially when processing a PagedSearch - and 
> we can't expect the writes to wait until such a search is completed. Another 
> solution was to gather all the candidates *before* fetching any entry, and to 
> store their associated UUID into a list of candidates.
> 
> This is not any different from how we were conducting searches with a OR in 
> the filter : it also gathers all the candidate UUID up to the end of the 
> search.
> 
> Doing so offers many advantage :
> - we can exit the search faster, with all the candidates already selected 
> (even for a paged search), and the writes can be applied more frequently
> - we speedup searches where more than one filter can be used to select the 
> valid candidates
> - now that the set has been created, we can move forward *and* backward, 
> something we can't do wth the current implementation when we use an OR/AND 
> connector in the filter.
> 
> There are some cons though :
> - gathering candiates UUID may eat memory
> - in some case, it may be slower than the current implementation, just 
> because the number of selected candidate may be way higher than the number of 
> entries we will return (we pay the cost of extra entry fetch in this case)
> 
> But even then, we already have to keep a set of candidate UUIDs when doing a 
> search with an OR/AND connector in the current implementation, and it' snot 
> proven that fetching an entry is more costly than fetching a <key/value> of a 
> given index - not to mention that the math may change a lot of we have more 
> than one Node in the filter-
> 
> Now, in order to fix the concurrent issue we have found last year, we have to 
> add a couple of ReadWriteLock :
> o one in the OperationManager, to forbid writes to be executed while some 
> searches are handling a ReadLock and to forbid a search to be executed while 
> a write handles a WriteLock
> o another one in the AbstractBtreePartition to protect the master table from 
> concurrent access, as the searches will directly fetch the entries from the 
> MasterTable - and the RdnIndex -, even when some writes are updating them.
> 
> The current branch implements this algorithm. It works, and it's fast. I have 
> tested the concurrent modifications and searches that demonstrated the issue 
> last year, I was able to run 10 000 loops over 20 threads, each one doing :
> - add cn=test<N>
> - search subtree cn=test*
> - delete cn= test<N>
> where N is the thread number.
> 
> This test was badly failing on trunk (the more threads we have, the more 
> errors we get), and is now running just fine.
> 
> (The test was failing because while doing the search, we were fetching 
> entries from the backend, when some deletion were being done at the same 
> time).
> 
> One more thing : performances. The branch is actually faster than the current 
> trunk. A search conducted using the SearchPerfIT test.
> 
> Here are the performances I get when using the embedded LDAP server :
> 
> 1.5.0 (FTR) :
> - Object   Search ->  32 993/s
> - OneLevel Search -> 156 935/s
> - SubLevel Search -> 137 225/s
> 
> Trunk :
> - Object   Search ->  57 191/s
> - OneLevel Search ->  76 250/s
> - SubLevel Search ->  77 200/s
> 
> MVBT branch :
> - Object   Search ->  76 540/s (+33%/trunk)
> - OneLevel Search -> 109 985/s (+44%/trunk)
> - SubLevel Search -> 134 750/s (+74%/trunk)
> 
> We get back one entry for the ObjectSearch, 5 for the OneLevel search and 10 
> for teh SubLevel search. What we measure is the number of entries we get back 
> per second.
> 
> The 1.5.0 version is actually faster for OneLevel searches and SubLevel 
> searches because we were storing the DN in the entry - rebuilding the Dn is a 
> costly operation -.
> 
> The gain is ever higher when we use a complex filter, due to the numerous and 
> spurious fetches of the same entry in trunk.
> 
> The same tests but using the network :
> MVBT branch :
> Object :  2 564/s
> One    : 17 960/s
> Sub    : 23 050/s
> 
> Trunk :
> Object :  2 567/s
> One    : 12 555/s
> Sub    : 21 240/s
> 
> Here, the gain i sless significant, mainly because we have the client and the 
> server on the same machine, and the messages encoding/decoding is shadowing 
> the gain in the backend. But still, we see some gap in the oneLevel and 
> subLevel searches (between 8% up to 43%).
> 
> At this point, I think we have a working server, solving a critical issue we 
> are fighting with for more than one year now, and it's actually faster than 
> the current trunk. It also solves some other issues.
> 
> Before continuing with the Mavibot backend implementation - which will brings 
> some major performance boosts, AFAICT - I could suggest we promote the 
> current mvbt branch as the trunk, and cut a 2.0.0-M8 release asap.  I do 
> think that it's a better, more stable, and faster working base for the 
> upcoming versions, and mor eimportant, the 2.0 GA.
> 
> We shoudl also continue the ongoing work on other solutions, as they offer 
> some extra advantages, but we may have them in a 2.1 or a later version.
> 
> so, wdyt ?
> 
> -- 
> Regards,
> Cordialement,
> Emmanuel Lécharny
> www.iktek.com
>

Re: Current status of the server, and mvbt-branch status

Reply via email to