Akash: Take a look at the following two possible replacements for CSLM: https://github.com/mspiegel/lockfreeskiptree https://github.com/nbronson/snaptree
About your test, I got advice from other expert: - the JVM warm-up penalty is being accrued by the CSLM run - Use thread.yield() to end a request, as otherwise the active thread may be able to run through its time slice without incurring much lock contention. You can publish your code and results under HBASE-3992. Thanks On Sun, Oct 23, 2011 at 5:05 PM, Jonathan Gray <jg...@fb.com> wrote: > Oh, and when running these experiments, you should look at the impact at > which order they are run in, whether you run them multiple times per JVM > instance, etc. Basically, you need to be cognizant of the HotSpot > optimizations the JVM is doing at runtime. > > > -----Original Message----- > > From: Jonathan Gray [mailto:jg...@fb.com] > > Sent: Sunday, October 23, 2011 4:20 PM > > To: dev@hbase.apache.org > > Subject: RE: SILT - nice keyvalue store paper > > > > Very nice experiment, Akash. Keep getting your hands dirty and digging! > :) > > > > I think your results might change if you bump the test up to 1000 threads > or > > so. 100 threads can still perform okay when there's a global lock but > the > > contention at 1000 threads will kill you and that's when CSLM should do > much > > better. (1000 handler threads is approx. what I run with on RS in prod). > > Though I am a bit surprised that at 100 threads the TreeMap was > significantly > > faster. Your inconsistent results are a bit odd, you might try an order > of > > magnitude more operations per thread. You might also gather some > > statistics about tree size and per operation latency. > > > > I've done some isolated CSLM benchmarks in the past and have never been > > able to reproduce any of the slowness people suggest. I recall trying > some > > impractically large MemStores and everything still being quite fast. > > > > Over in Cassandra, I believe they have a two-level CSLM with the first > map > > key being the row and then the columns for each row in their own CSLM. > > I've been told this is somewhat of a pain point for them. And keep in > mind > > they have one shard/region per node and we generally have several smaller > > MemStores on each node (tens to thousands). Not sure we would want to > > try that. There could be some interesting optimizations if you had very > > specific issues, like if you had a ton of reads to MemStore and not many > > writes you could keep some kind of mirrored hashmap. > > > > And for writes, the WAL is definitely the latency bottleneck. But if you > are > > doing lots of small operations, so your WALEdits are not large, and with > some > > of the HLog batching features going in to trunk, you end up with hundreds > of > > requests per HLog sync. And although the syncs are higher latency, with > > batching you end up getting high throughput. And the bottleneck shifts. > > > > Each sync will take approx. 1-5ms, so let's say 250 requests per HLog > sync > > batch, 4ms per sync, so 62.5k req/sec. (62.5k * 100 bytes/req = > 600K/sec, > > very reasonable). If you're mixing in reads as well (or if you're doing > > increments which do a read and write), then this adds to the CPU usage > and > > contention without adding to HLog throughput. > > > > All of a sudden the bottleneck becomes CPU/contention and not HLog > > latency or throughput. Highly concurrent increments/counters with a > largely > > in-memory dataset can easily be CPU bottlenecked. > > > > For one specific application Dhruba and I worked on, we made some good > > improvements in CPU efficiency by reducing the number of operations and > > increasing efficiency on the CSLM. Doing things like always taking a > tailMap > > and working from that instead of starting at the root node, using an > iterator() > > and taking advantage of the available remove() semantics, or simply just > > mutating things that are normally immutable :) Unfortunately many of > these > > optimizations were semi-horrid hacks and introduced things like > > ModifiableKeyValues, so they all haven't made their way to apache. > > > > In the end, after our optimizations, the real world workload Dhruba and I > > were working with was not all in-memory so the bottleneck in production > > became the random reads (so increasing the block cache hit ratio is the > > focus) rather than CPU contention or HLog throughput. > > > > JG > > > > From: Akash Ashok [mailto:thehellma...@gmail.com] > > Sent: Sunday, October 23, 2011 2:57 AM > > To: dev@hbase.apache.org > > Subject: Re: SILT - nice keyvalue store paper > > > > I was running some similar tests and came across a surprising finding. I > > compared reads and write on ConcurrentSkipListMap ( which the memstore > > uses) and synchronized TreeMap ( Which was literally treemap > > synchronized). Executed concurrent reads, writes and deletes on both of > > them. > > Surprisingly synchronized treeMap performed better, though just slightly > > better, than ConcurrentSkipListMap which KeyValueSkipListSet uses. > > > > Here are the output of a few runs > > > > Sometimes the difference was considerable Using HBaseMap it took > > 20438ms Using TreeMap it took 11613ms Time Difference:8825ms > > > > And sometimes the difference was negligible Using HBaseMap it took > > 13370ms Using TreeMap it took 9482ms Time Difference:3888ms > > > > I've attaching the test java file which I wrote to test it. > > This might be a very minor differece but still surprising considering the > fact > > that ConcurrentSkipListMap uses fancy 2 level indexes which they say > > improves the deletion performance. > > > > And here are the details about the test run. > > 100 Threads each fetching 1,000,000 records > > 100 threads each adding 1,000,000 records. > > 100 threads each deletin 1,000,000 records ( Reads, Writes and deletes > > simultaneously ) > > > > Cheers, > > Akash A > > On Sun, Oct 23, 2011 at 3:25 AM, Stack > > <st...@duboce.net<mailto:st...@duboce.net>> wrote: > > On Sat, Oct 22, 2011 at 2:41 PM, N Keywal > > <nkey...@gmail.com<mailto:nkey...@gmail.com>> wrote: > > > I would think that the bottleneck for insert is the wal part? > > > It would be possible to do a kind of memory list preparation during > > > the wal insertion, and if the wal insertion is confirmed, do the > > > insertion in the memory list. But it's strange to have the insertion in > > memory important vs. > > > the insertion on disk... > > > > > Yes, WAL is the long pole writing. But MemStore has issues too; Dhruba > says > > cpu above. Reading and writing it is also 'slow'. > > St.Ack > >