On Mon, Oct 24, 2011 at 4:50 AM, Jonathan Gray <jg...@fb.com> wrote: > Very nice experiment, Akash. Keep getting your hands dirty and digging! > :) > > I think your results might change if you bump the test up to 1000 threads > or so. 100 threads can still perform okay when there's a global lock but > the contention at 1000 threads will kill you and that's when CSLM should do > much better. (1000 handler threads is approx. what I run with on RS in > prod). Though I am a bit surprised that at 100 threads the TreeMap was > significantly faster. Your inconsistent results are a bit odd, you might > try an order of magnitude more operations per thread. You might also gather > some statistics about tree size and per operation latency. >
Hey JG . At 1000 threads the situation reversed, CSLM performed much better as you said. I just took it a little overboard just to see what happens at 10,000 threads. TreeMap globally synchronized again performed better consistently. which was super duper surprising, I am just trying to figure out why this is happening. Will post the results soon. > I've done some isolated CSLM benchmarks in the past and have never been > able to reproduce any of the slowness people suggest. I recall trying some > impractically large MemStores and everything still being quite fast. > Over in Cassandra, I believe they have a two-level CSLM with the first map > key being the row and then the columns for each row in their own CSLM. I've > been told this is somewhat of a pain point for them. And keep in mind they > have one shard/region per node and we generally have several smaller > MemStores on each node (tens to thousands). Not sure we would want to try > that. There could be some interesting optimizations if you had very > specific issues, like if you had a ton of reads to MemStore and not many > writes you could keep some kind of mirrored hashmap. > > And for writes, the WAL is definitely the latency bottleneck. But if you > are doing lots of small operations, so your WALEdits are not large, and with > some of the HLog batching features going in to trunk, you end up with > hundreds of requests per HLog sync. And although the syncs are higher > latency, with batching you end up getting high throughput. And the > bottleneck shifts. > > Each sync will take approx. 1-5ms, so let's say 250 requests per HLog sync > batch, 4ms per sync, so 62.5k req/sec. (62.5k * 100 bytes/req = 600K/sec, > very reasonable). If you're mixing in reads as well (or if you're doing > increments which do a read and write), then this adds to the CPU usage and > contention without adding to HLog throughput. > > All of a sudden the bottleneck becomes CPU/contention and not HLog latency > or throughput. Highly concurrent increments/counters with a largely > in-memory dataset can easily be CPU bottlenecked. > > For one specific application Dhruba and I worked on, we made some good > improvements in CPU efficiency by reducing the number of operations and > increasing efficiency on the CSLM. Doing things like always taking a > tailMap and working from that instead of starting at the root node, using an > iterator() and taking advantage of the available remove() semantics, or > simply just mutating things that are normally immutable :) Unfortunately > many of these optimizations were semi-horrid hacks and introduced things > like ModifiableKeyValues, so they all haven't made their way to apache. > > In the end, after our optimizations, the real world workload Dhruba and I > were working with was not all in-memory so the bottleneck in production > became the random reads (so increasing the block cache hit ratio is the > focus) rather than CPU contention or HLog throughput. > > JG > > From: Akash Ashok [mailto:thehellma...@gmail.com] > Sent: Sunday, October 23, 2011 2:57 AM > To: dev@hbase.apache.org > Subject: Re: SILT - nice keyvalue store paper > > I was running some similar tests and came across a surprising finding. I > compared reads and write on ConcurrentSkipListMap ( which the memstore uses) > and synchronized TreeMap ( Which was literally treemap synchronized). > Executed concurrent reads, writes and deletes on both of them. > Surprisingly synchronized treeMap performed better, though just slightly > better, than ConcurrentSkipListMap which KeyValueSkipListSet uses. > > Here are the output of a few runs > > Sometimes the difference was considerable > Using HBaseMap it took 20438ms > Using TreeMap it took 11613ms > Time Difference:8825ms > > And sometimes the difference was negligible > Using HBaseMap it took 13370ms > Using TreeMap it took 9482ms > Time Difference:3888ms > > I've attaching the test java file which I wrote to test it. > This might be a very minor differece but still surprising considering the > fact that ConcurrentSkipListMap uses fancy 2 level indexes which they say > improves the deletion performance. > > And here are the details about the test run. > 100 Threads each fetching 1,000,000 records > 100 threads each adding 1,000,000 records. > 100 threads each deletin 1,000,000 records > ( Reads, Writes and deletes simultaneously ) > > Cheers, > Akash A > On Sun, Oct 23, 2011 at 3:25 AM, Stack <st...@duboce.net<mailto: > st...@duboce.net>> wrote: > On Sat, Oct 22, 2011 at 2:41 PM, N Keywal <nkey...@gmail.com<mailto: > nkey...@gmail.com>> wrote: > > I would think that the bottleneck for insert is the wal part? > > It would be possible to do a kind of memory list preparation during the > wal > > insertion, and if the wal insertion is confirmed, do the insertion in the > > memory list. But it's strange to have the insertion in memory important > vs. > > the insertion on disk... > > > Yes, WAL is the long pole writing. But MemStore has issues too; > Dhruba says cpu above. Reading and writing it is also 'slow'. > St.Ack > >