Also please make note of what JVM version and report it with the results.
Best regards, - Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White) ----- Original Message ----- > From: Li Pi <l...@idle.li> > To: dev@hbase.apache.org > Cc: > Sent: Monday, October 24, 2011 11:27 PM > Subject: Re: CSLM performance Was: SILT - nice keyvalue store paper > > You might also want to run the code a few times, and only take results > from the latter half. Let the JVM warm up and JIT things for a fair > comparison. > > On Mon, Oct 24, 2011 at 11:14 PM, Akash Ashok <thehellma...@gmail.com> > wrote: >> On Mon, Oct 24, 2011 at 10:49 PM, Ted Yu <yuzhih...@gmail.com> wrote: >> >>> Akash: >>> Take a look at the following two possible replacements for CSLM: >>> https://github.com/mspiegel/lockfreeskiptree >>> https://github.com/nbronson/snaptree >> >> Thanks Ted :) :) :). Was pretty much on the lookout for other structures. I >> tried >> > http://g.oswego.edu/dl/classes/EDU/oswego/cs/dl/util/concurrent/SyncSortedMap.html >> but didn't perform that great. Will look into these replacements. >> >> >>> >>> About your test, I got advice from other expert: >>> >>> - the JVM warm-up penalty is being accrued by the CSLM run >>> - Use thread.yield() to end a request, as otherwise the active thread > may >>> be >>> able to run through its time slice without incurring much lock > contention. >>> >> >> Hmm. I was just wondering. Since each thread 10000+ inserts and deletes are >> being run >> they have to context switch quite a few times during its lifecycle right ? >> Wouldn't it be enough if I increase the number of iterations by a > factor if >> say 100 per thread ? >> >> >> >>> You can publish your code and results under HBASE-3992. >>> >>> Thanks >>> >>> On Sun, Oct 23, 2011 at 5:05 PM, Jonathan Gray <jg...@fb.com> > wrote: >>> >>> > Oh, and when running these experiments, you should look at the > impact at >>> > which order they are run in, whether you run them multiple times > per JVM >>> > instance, etc. Basically, you need to be cognizant of the HotSpot >>> > optimizations the JVM is doing at runtime. >>> > >>> > > -----Original Message----- >>> > > From: Jonathan Gray [mailto:jg...@fb.com] >>> > > Sent: Sunday, October 23, 2011 4:20 PM >>> > > To: dev@hbase.apache.org >>> > > Subject: RE: SILT - nice keyvalue store paper >>> > > >>> > > Very nice experiment, Akash. Keep getting your hands dirty > and >>> digging! >>> > :) >>> > > >>> > > I think your results might change if you bump the test up to > 1000 >>> threads >>> > or >>> > > so. 100 threads can still perform okay when there's a > global lock but >>> > the >>> > > contention at 1000 threads will kill you and that's when > CSLM should do >>> > much >>> > > better. (1000 handler threads is approx. what I run with on > RS in >>> prod). >>> > > Though I am a bit surprised that at 100 threads the TreeMap > was >>> > significantly >>> > > faster. Your inconsistent results are a bit odd, you might > try an >>> order >>> > of >>> > > magnitude more operations per thread. You might also gather > some >>> > > statistics about tree size and per operation latency. >>> > > >>> > > I've done some isolated CSLM benchmarks in the past and > have never been >>> > > able to reproduce any of the slowness people suggest. I > recall trying >>> > some >>> > > impractically large MemStores and everything still being > quite fast. >>> > > >>> > > Over in Cassandra, I believe they have a two-level CSLM with > the first >>> > map >>> > > key being the row and then the columns for each row in their > own CSLM. >>> > > I've been told this is somewhat of a pain point for them. > And keep in >>> > mind >>> > > they have one shard/region per node and we generally have > several >>> smaller >>> > > MemStores on each node (tens to thousands). Not sure we > would want to >>> > > try that. There could be some interesting optimizations if > you had >>> very >>> > > specific issues, like if you had a ton of reads to MemStore > and not >>> many >>> > > writes you could keep some kind of mirrored hashmap. >>> > > >>> > > And for writes, the WAL is definitely the latency bottleneck. > But if >>> you >>> > are >>> > > doing lots of small operations, so your WALEdits are not > large, and >>> with >>> > some >>> > > of the HLog batching features going in to trunk, you end up > with >>> hundreds >>> > of >>> > > requests per HLog sync. And although the syncs are higher > latency, >>> with >>> > > batching you end up getting high throughput. And the > bottleneck >>> shifts. >>> > > >>> > > Each sync will take approx. 1-5ms, so let's say 250 > requests per HLog >>> > sync >>> > > batch, 4ms per sync, so 62.5k req/sec. (62.5k * 100 > bytes/req = >>> > 600K/sec, >>> > > very reasonable). If you're mixing in reads as well (or > if you're >>> doing >>> > > increments which do a read and write), then this adds to the > CPU usage >>> > and >>> > > contention without adding to HLog throughput. >>> > > >>> > > All of a sudden the bottleneck becomes CPU/contention and not > HLog >>> > > latency or throughput. Highly concurrent increments/counters > with a >>> > largely >>> > > in-memory dataset can easily be CPU bottlenecked. >>> > > >>> > > For one specific application Dhruba and I worked on, we made > some good >>> > > improvements in CPU efficiency by reducing the number of > operations and >>> > > increasing efficiency on the CSLM. Doing things like always > taking a >>> > tailMap >>> > > and working from that instead of starting at the root node, > using an >>> > iterator() >>> > > and taking advantage of the available remove() semantics, or > simply >>> just >>> > > mutating things that are normally immutable :) Unfortunately > many of >>> > these >>> > > optimizations were semi-horrid hacks and introduced things > like >>> > > ModifiableKeyValues, so they all haven't made their way > to apache. >>> > > >>> > > In the end, after our optimizations, the real world workload > Dhruba and >>> I >>> > > were working with was not all in-memory so the bottleneck in > production >>> > > became the random reads (so increasing the block cache hit > ratio is the >>> > > focus) rather than CPU contention or HLog throughput. >>> > > >>> > > JG >>> > > >>> > > From: Akash Ashok [mailto:thehellma...@gmail.com] >>> > > Sent: Sunday, October 23, 2011 2:57 AM >>> > > To: dev@hbase.apache.org >>> > > Subject: Re: SILT - nice keyvalue store paper >>> > > >>> > > I was running some similar tests and came across a surprising > finding. >>> I >>> > > compared reads and write on ConcurrentSkipListMap ( which the > memstore >>> > > uses) and synchronized TreeMap ( Which was literally treemap >>> > > synchronized). Executed concurrent reads, writes and deletes > on both of >>> > > them. >>> > > Surprisingly synchronized treeMap performed better, though > just >>> slightly >>> > > better, than ConcurrentSkipListMap which KeyValueSkipListSet > uses. >>> > > >>> > > Here are the output of a few runs >>> > > >>> > > Sometimes the difference was considerable Using HBaseMap it > took >>> > > 20438ms Using TreeMap it took 11613ms Time Difference:8825ms >>> > > >>> > > And sometimes the difference was negligible Using HBaseMap it > took >>> > > 13370ms Using TreeMap it took 9482ms Time Difference:3888ms >>> > > >>> > > I've attaching the test java file which I wrote to test > it. >>> > > This might be a very minor differece but still surprising > considering >>> the >>> > fact >>> > > that ConcurrentSkipListMap uses fancy 2 level indexes which > they say >>> > > improves the deletion performance. >>> > > >>> > > And here are the details about the test run. >>> > > 100 Threads each fetching 1,000,000 records >>> > > 100 threads each adding 1,000,000 records. >>> > > 100 threads each deletin 1,000,000 records ( Reads, Writes > and deletes >>> > > simultaneously ) >>> > > >>> > > Cheers, >>> > > Akash A >>> > > On Sun, Oct 23, 2011 at 3:25 AM, Stack >>> > > <st...@duboce.net<mailto:st...@duboce.net>> > wrote: >>> > > On Sat, Oct 22, 2011 at 2:41 PM, N Keywal >>> > > <nkey...@gmail.com<mailto:nkey...@gmail.com>> > wrote: >>> > > > I would think that the bottleneck for insert is the wal > part? >>> > > > It would be possible to do a kind of memory list > preparation during >>> > > > the wal insertion, and if the wal insertion is > confirmed, do the >>> > > > insertion in the memory list. But it's strange to > have the insertion >>> in >>> > > memory important vs. >>> > > > the insertion on disk... >>> > > > >>> > > Yes, WAL is the long pole writing. But MemStore has issues > too; Dhruba >>> > says >>> > > cpu above. Reading and writing it is also 'slow'. >>> > > St.Ack >>> > >>> > >>> >> >