On Mon, Oct 24, 2011 at 10:49 PM, Ted Yu <yuzhih...@gmail.com> wrote:
> Akash: > Take a look at the following two possible replacements for CSLM: > https://github.com/mspiegel/lockfreeskiptree > https://github.com/nbronson/snaptree Thanks Ted :) :) :). Was pretty much on the lookout for other structures. I tried http://g.oswego.edu/dl/classes/EDU/oswego/cs/dl/util/concurrent/SyncSortedMap.html but didn't perform that great. Will look into these replacements. > > About your test, I got advice from other expert: > > - the JVM warm-up penalty is being accrued by the CSLM run > - Use thread.yield() to end a request, as otherwise the active thread may > be > able to run through its time slice without incurring much lock contention. > Hmm. I was just wondering. Since each thread 10000+ inserts and deletes are being run they have to context switch quite a few times during its lifecycle right ? Wouldn't it be enough if I increase the number of iterations by a factor if say 100 per thread ? > You can publish your code and results under HBASE-3992. > > Thanks > > On Sun, Oct 23, 2011 at 5:05 PM, Jonathan Gray <jg...@fb.com> wrote: > > > Oh, and when running these experiments, you should look at the impact at > > which order they are run in, whether you run them multiple times per JVM > > instance, etc. Basically, you need to be cognizant of the HotSpot > > optimizations the JVM is doing at runtime. > > > > > -----Original Message----- > > > From: Jonathan Gray [mailto:jg...@fb.com] > > > Sent: Sunday, October 23, 2011 4:20 PM > > > To: dev@hbase.apache.org > > > Subject: RE: SILT - nice keyvalue store paper > > > > > > Very nice experiment, Akash. Keep getting your hands dirty and > digging! > > :) > > > > > > I think your results might change if you bump the test up to 1000 > threads > > or > > > so. 100 threads can still perform okay when there's a global lock but > > the > > > contention at 1000 threads will kill you and that's when CSLM should do > > much > > > better. (1000 handler threads is approx. what I run with on RS in > prod). > > > Though I am a bit surprised that at 100 threads the TreeMap was > > significantly > > > faster. Your inconsistent results are a bit odd, you might try an > order > > of > > > magnitude more operations per thread. You might also gather some > > > statistics about tree size and per operation latency. > > > > > > I've done some isolated CSLM benchmarks in the past and have never been > > > able to reproduce any of the slowness people suggest. I recall trying > > some > > > impractically large MemStores and everything still being quite fast. > > > > > > Over in Cassandra, I believe they have a two-level CSLM with the first > > map > > > key being the row and then the columns for each row in their own CSLM. > > > I've been told this is somewhat of a pain point for them. And keep in > > mind > > > they have one shard/region per node and we generally have several > smaller > > > MemStores on each node (tens to thousands). Not sure we would want to > > > try that. There could be some interesting optimizations if you had > very > > > specific issues, like if you had a ton of reads to MemStore and not > many > > > writes you could keep some kind of mirrored hashmap. > > > > > > And for writes, the WAL is definitely the latency bottleneck. But if > you > > are > > > doing lots of small operations, so your WALEdits are not large, and > with > > some > > > of the HLog batching features going in to trunk, you end up with > hundreds > > of > > > requests per HLog sync. And although the syncs are higher latency, > with > > > batching you end up getting high throughput. And the bottleneck > shifts. > > > > > > Each sync will take approx. 1-5ms, so let's say 250 requests per HLog > > sync > > > batch, 4ms per sync, so 62.5k req/sec. (62.5k * 100 bytes/req = > > 600K/sec, > > > very reasonable). If you're mixing in reads as well (or if you're > doing > > > increments which do a read and write), then this adds to the CPU usage > > and > > > contention without adding to HLog throughput. > > > > > > All of a sudden the bottleneck becomes CPU/contention and not HLog > > > latency or throughput. Highly concurrent increments/counters with a > > largely > > > in-memory dataset can easily be CPU bottlenecked. > > > > > > For one specific application Dhruba and I worked on, we made some good > > > improvements in CPU efficiency by reducing the number of operations and > > > increasing efficiency on the CSLM. Doing things like always taking a > > tailMap > > > and working from that instead of starting at the root node, using an > > iterator() > > > and taking advantage of the available remove() semantics, or simply > just > > > mutating things that are normally immutable :) Unfortunately many of > > these > > > optimizations were semi-horrid hacks and introduced things like > > > ModifiableKeyValues, so they all haven't made their way to apache. > > > > > > In the end, after our optimizations, the real world workload Dhruba and > I > > > were working with was not all in-memory so the bottleneck in production > > > became the random reads (so increasing the block cache hit ratio is the > > > focus) rather than CPU contention or HLog throughput. > > > > > > JG > > > > > > From: Akash Ashok [mailto:thehellma...@gmail.com] > > > Sent: Sunday, October 23, 2011 2:57 AM > > > To: dev@hbase.apache.org > > > Subject: Re: SILT - nice keyvalue store paper > > > > > > I was running some similar tests and came across a surprising finding. > I > > > compared reads and write on ConcurrentSkipListMap ( which the memstore > > > uses) and synchronized TreeMap ( Which was literally treemap > > > synchronized). Executed concurrent reads, writes and deletes on both of > > > them. > > > Surprisingly synchronized treeMap performed better, though just > slightly > > > better, than ConcurrentSkipListMap which KeyValueSkipListSet uses. > > > > > > Here are the output of a few runs > > > > > > Sometimes the difference was considerable Using HBaseMap it took > > > 20438ms Using TreeMap it took 11613ms Time Difference:8825ms > > > > > > And sometimes the difference was negligible Using HBaseMap it took > > > 13370ms Using TreeMap it took 9482ms Time Difference:3888ms > > > > > > I've attaching the test java file which I wrote to test it. > > > This might be a very minor differece but still surprising considering > the > > fact > > > that ConcurrentSkipListMap uses fancy 2 level indexes which they say > > > improves the deletion performance. > > > > > > And here are the details about the test run. > > > 100 Threads each fetching 1,000,000 records > > > 100 threads each adding 1,000,000 records. > > > 100 threads each deletin 1,000,000 records ( Reads, Writes and deletes > > > simultaneously ) > > > > > > Cheers, > > > Akash A > > > On Sun, Oct 23, 2011 at 3:25 AM, Stack > > > <st...@duboce.net<mailto:st...@duboce.net>> wrote: > > > On Sat, Oct 22, 2011 at 2:41 PM, N Keywal > > > <nkey...@gmail.com<mailto:nkey...@gmail.com>> wrote: > > > > I would think that the bottleneck for insert is the wal part? > > > > It would be possible to do a kind of memory list preparation during > > > > the wal insertion, and if the wal insertion is confirmed, do the > > > > insertion in the memory list. But it's strange to have the insertion > in > > > memory important vs. > > > > the insertion on disk... > > > > > > > Yes, WAL is the long pole writing. But MemStore has issues too; Dhruba > > says > > > cpu above. Reading and writing it is also 'slow'. > > > St.Ack > > > > >