Oooops sorry just saw this thread. I responded to the other thread. Sorry about this. Will switch to this one.
On Mon, Aug 29, 2011 at 12:55 PM, Emmanuel Lecharny <[email protected]>wrote: > Hi Selcuk, > > > On 8/29/11 10:52 AM, Selcuk AYA wrote: > >> Hi I just attached my latest changes for the jdbm branch and wanted to >> give a status update and some technical details: >> > Many thanks ! > > >> 1) Summary >> *We now have a jdbm tree which treats find, insert, remove and browse >> as actions that execute in isolation. In particular, read actions will >> not be affected by ongoing structural changes to the tree and will >> only see data changes that completed before they started. >> > Just great ! > > >> *We allow one writer and multiple readers to execute concurrently. >> Synchronized operations are mostly removed. >> > Woot ! We need to do some perf tests to see what kind of performances > improvement it brings... > > >> * Exisiting tests except the StoredProceduteIT and the unit tests I >> added for the versioned tree pass( I did mvn clean install >> -Dintegration). I think the problem with StoredProceduteIT is an >> existing one. There is a code piece where I serialize and deserialize >> tuple values stored in JDBM btree in order to do a deep copy. With >> StoredProcedureIT hello world stored procedure deserialization throws >> a UTFDataFormatException. On a clean brach, I added similar code to >> deserialize B+ tree page values right after they are serialized, and I >> hit the same issue. So I think this is some existing issue with stored >> procedure serialization/deserialization. >> > I gonna review the ignored test. > > >> 2) Changes above JDBM level >> * I added changes to call the newly added browser->close() interface >> when the cursors are closed or a cursor is repositioned. >> * I hit some existing issues where cursors are not closed. In >> particular, I had to change SubentryInterceptor.java.java to close the >> cursor after search operations and change the JDBM container cursor to >> close the contained cursor when it is closed. If required, I can >> provide these changes as separate fixes >> > Yeah, just create a new JIRA for this one, and attach new patches. > > >> 3) Technical details at JDBM level: >> >> *The core functionality is at LRUCache.java. This implements a >> concurrent, versioned cache. There a power of two hash buckets and a >> lock for each of the 8 buckets(lock striping). Number of hash buckets >> x where x is closest power of two such that x< max number of cache >> entries. >> > Ok. > > >> *Cache replacement policy is LRU. There are 16 lru lists and each >> cache entry is assigned to one of the lru lists. Each lru is protected >> by a separate lock. LRU replacement is supposed to be fast. Threads >> choose an lru based on an lru randomizer. Since replacement is >> supposed to be fast and each thread randomly chooses an lru to replace >> from, lru operations should not be a bottleneck. >> > Do we need to make this number (16) configurable ? > > >> * Each cache entry has a [startVersion, endVersion) where it is valid. >> At any time, a hash bucket chain looks like this: >> (key1, startVersion11, endVersion11)<-> (key2, startversion21, >> endVersion21)<-> >> | >> | >> (key1, startVersion12, endVersion12) (key2, startversion22, >> endVersion22) >> | >> | >> .... >> ....... >> >> that is, there is a primary chain where entries for different keys are >> chained and then there is subchain where different versions for a >> given key are held. So, when readers search for a (key, version), they >> first walk the primary chain and then walk the subchain to find their >> entry. >> > The schema is not rendered correctly in the mail, but it's not a big deal. > I'm not sure I grok your explanation though. I need to re-read it later... > > >> *As writes create previous versions of entries, >> > You mean 'new versions', right ? Or maybe what you mean is that once you > modify an existing entry, the previous version of this entry is stored into > the cache ? > > they use part of the >> cache to store them. The rule is that such an entry cannot be replaced >> as long as there might be a reader which might read it. We keep track >> of the minimum read action version to make such entries replaceable . >> >> *As writes create previous versions of entries, they use part of the >> cache to store them. If there are long browse operations and quite a >> bit of updates going on at the same time, we might run into a case >> where most of the cache entries are used to store previous versions. >> > > Ok. > > We might even have a case where all entries store previous versions >> and they cannot be replaced(because of the rule above). In this case, >> writers wait for a freeable cache entry. >> > Ok. We could probably chose to keep the old versions on disk, to avoid > having to hang a write, but for a first version, I think it's an acceptable > solution. > > When a reader cannot find a >> replaceable entry, it does read from disk while holding the bucket >> latch(and thus holding any possible writer on the same location). and >> return the entry to the user without populating the cache and thus >> without looking for a replaceable cache entry. Since readers always >> make progress, min read version will eventually advance and writers >> will progress too. Normally, when readers or writers do IO, they >> release the hash latch. >> > Ok. > > >> * There some helper classes for the LRUCache to work. Maybe the most >> interesting ones are ActionVersioning which uses AtomicInteger and >> AtomicReference and is optimized for the read mostly case. Also we >> have ExplicitList where remove operations are O(1) given an >> element(this is in contrast to O(n) remove given a pointer to an >> element on Java's lists). Such fast removes are needed for lru >> algorithm. >> > Ok > > >> *When (key,value) pairs are added to the Btree or when they are >> retrieved, Btree does a deep copy of the value(through serialization, >> deserialization). This is needed so that Btree can store previous >> versions of values. I assumed key stored in Btrees are not changed. If >> the do, even the CacheRecordManager currently in use wouldnt work. >> > Do you mean we never delete anything from the BTree ? > > >> 4) Possible improvements: >> *if most of the cache entries are used to store previous versions, >> cache effectiveness will decrease.A solultion is to start spilling >> previous versions to disk when such a thing happens. The subchain we >> talked about above would have to be spilled to disk. However, this is >> only a performance problem and is a corner case one as well if it is >> true that ldap is read mostly. >> > Yes. See above one of my comment. > > >> * Currently when a write action is executing, if there is an IO >> exception action is aborted and I do not advance the read version and >> thus readers do not see the affects of the aborted action. However, it >> seems that upper layers do not do enough cleanup in this case, they >> continue using the jdbm stores and this will lead to inconsistency. A >> good thing would be to rollback all the dirty changes . Also, jdbm >> txns are not enable currently so a crash in the middle of syncing >> might leave the store inconsistent. >> > I guess we have to address those issues. Crashes can occur when we get a > file-system full, or a defective disk. In both cases, I think having a crash > recovery system should be enough, as a first approach. In any case, I guess > that the server has to be started when it occurs, so that some corrective > actions can be done before the backend is servered any more... > > >> 5) TODO: >> *add some more test cases for the verisioned btree to test corner cases. >> *I am not very willing to implement disk spilling since this is only a >> performance improvement needed in corner cases if stores are mostly >> read only. But if you guys think this is really necessary, I might >> look into this as well. >> > Right now, I consider that it's just an improvement. What is important atm > is to have a working server. That means we need more tests, and some wide > ones (ie, with many clients injecting read and write requests, run for some > long period of time). > > Selcuk, this is an excellent work ! I'm going to apply your last changes to > the repo. Many thanks ! > > > -- > Regards, > Cordialement, > Emmanuel Lécharny > www.iktek.com > > -- Best Regards, -- Alex
