Re: CSLM performance Was: SILT - nice keyvalue store paper

Andrew Purtell Tue, 25 Oct 2011 07:37:24 -0700

Also please make note of what JVM version and report it with the results.


Best regards,


   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein (via 
Tom White)


----- Original Message -----
> From: Li Pi <l...@idle.li>
> To: dev@hbase.apache.org
> Cc: 
> Sent: Monday, October 24, 2011 11:27 PM
> Subject: Re: CSLM performance Was: SILT - nice keyvalue store paper
> 
> You might also want to run the code a few times, and only take results
> from the latter half. Let the JVM warm up and JIT things for a fair
> comparison.
> 
> On Mon, Oct 24, 2011 at 11:14 PM, Akash Ashok <thehellma...@gmail.com> 
> wrote:
>>  On Mon, Oct 24, 2011 at 10:49 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>> 
>>>  Akash:
>>>  Take a look at the following two possible replacements for CSLM:
>>>  https://github.com/mspiegel/lockfreeskiptree
>>>  https://github.com/nbronson/snaptree
>> 
>>  Thanks Ted :) :) :). Was pretty much on the lookout for other structures. I
>>  tried
>> 
> http://g.oswego.edu/dl/classes/EDU/oswego/cs/dl/util/concurrent/SyncSortedMap.html
>>  but didn't perform that great. Will look into these replacements.
>> 
>> 
>>> 
>>>  About your test, I got advice from other expert:
>>> 
>>>  - the JVM warm-up penalty is being accrued by the CSLM run
>>>  - Use thread.yield() to end a request, as otherwise the active thread 
> may
>>>  be
>>>  able to run through its time slice without incurring much lock 
> contention.
>>> 
>> 
>>  Hmm. I was just wondering. Since each thread 10000+ inserts and deletes are
>>  being run
>>  they have to context switch quite a few times during its lifecycle right ?
>>  Wouldn't it be enough if I increase the number of iterations by a 
> factor if
>>  say 100 per thread ?
>> 
>> 
>> 
>>>  You can publish your code and results under HBASE-3992.
>>> 
>>>  Thanks
>>> 
>>>  On Sun, Oct 23, 2011 at 5:05 PM, Jonathan Gray <jg...@fb.com> 
> wrote:
>>> 
>>>  > Oh, and when running these experiments, you should look at the 
> impact at
>>>  > which order they are run in, whether you run them multiple times 
> per JVM
>>>  > instance, etc.  Basically, you need to be cognizant of the HotSpot
>>>  > optimizations the JVM is doing at runtime.
>>>  >
>>>  > > -----Original Message-----
>>>  > > From: Jonathan Gray [mailto:jg...@fb.com]
>>>  > > Sent: Sunday, October 23, 2011 4:20 PM
>>>  > > To: dev@hbase.apache.org
>>>  > > Subject: RE: SILT - nice keyvalue store paper
>>>  > >
>>>  > > Very nice experiment, Akash.  Keep getting your hands dirty 
> and
>>>  digging!
>>>  >  :)
>>>  > >
>>>  > > I think your results might change if you bump the test up to 
> 1000
>>>  threads
>>>  > or
>>>  > > so.  100 threads can still perform okay when there's a 
> global lock but
>>>  > the
>>>  > > contention at 1000 threads will kill you and that's when 
> CSLM should do
>>>  > much
>>>  > > better.  (1000 handler threads is approx. what I run with on 
> RS in
>>>  prod).
>>>  > > Though I am a bit surprised that at 100 threads the TreeMap 
> was
>>>  > significantly
>>>  > > faster.  Your inconsistent results are a bit odd, you might 
> try an
>>>  order
>>>  > of
>>>  > > magnitude more operations per thread.  You might also gather 
> some
>>>  > > statistics about tree size and per operation latency.
>>>  > >
>>>  > > I've done some isolated CSLM benchmarks in the past and 
> have never been
>>>  > > able to reproduce any of the slowness people suggest.  I 
> recall trying
>>>  > some
>>>  > > impractically large MemStores and everything still being 
> quite fast.
>>>  > >
>>>  > > Over in Cassandra, I believe they have a two-level CSLM with 
> the first
>>>  > map
>>>  > > key being the row and then the columns for each row in their 
> own CSLM.
>>>  > > I've been told this is somewhat of a pain point for them. 
>  And keep in
>>>  > mind
>>>  > > they have one shard/region per node and we generally have 
> several
>>>  smaller
>>>  > > MemStores on each node (tens to thousands).  Not sure we 
> would want to
>>>  > > try that.  There could be some interesting optimizations if 
> you had
>>>  very
>>>  > > specific issues, like if you had a ton of reads to MemStore 
> and not
>>>  many
>>>  > > writes you could keep some kind of mirrored hashmap.
>>>  > >
>>>  > > And for writes, the WAL is definitely the latency bottleneck. 
>  But if
>>>  you
>>>  > are
>>>  > > doing lots of small operations, so your WALEdits are not 
> large, and
>>>  with
>>>  > some
>>>  > > of the HLog batching features going in to trunk, you end up 
> with
>>>  hundreds
>>>  > of
>>>  > > requests per HLog sync.  And although the syncs are higher 
> latency,
>>>  with
>>>  > > batching you end up getting high throughput.  And the 
> bottleneck
>>>  shifts.
>>>  > >
>>>  > > Each sync will take approx. 1-5ms, so let's say 250 
> requests per HLog
>>>  > sync
>>>  > > batch, 4ms per sync, so 62.5k req/sec.  (62.5k * 100 
> bytes/req =
>>>  > 600K/sec,
>>>  > > very reasonable).  If you're mixing in reads as well (or 
> if you're
>>>  doing
>>>  > > increments which do a read and write), then this adds to the 
> CPU usage
>>>  > and
>>>  > > contention without adding to HLog throughput.
>>>  > >
>>>  > > All of a sudden the bottleneck becomes CPU/contention and not 
> HLog
>>>  > > latency or throughput.  Highly concurrent increments/counters 
> with a
>>>  > largely
>>>  > > in-memory dataset can easily be CPU bottlenecked.
>>>  > >
>>>  > > For one specific application Dhruba and I worked on, we made 
> some good
>>>  > > improvements in CPU efficiency by reducing the number of 
> operations and
>>>  > > increasing efficiency on the CSLM.  Doing things like always 
> taking a
>>>  > tailMap
>>>  > > and working from that instead of starting at the root node, 
> using an
>>>  > iterator()
>>>  > > and taking advantage of the available remove() semantics, or 
> simply
>>>  just
>>>  > > mutating things that are normally immutable :)  Unfortunately 
> many of
>>>  > these
>>>  > > optimizations were semi-horrid hacks and introduced things 
> like
>>>  > > ModifiableKeyValues, so they all haven't made their way 
> to apache.
>>>  > >
>>>  > > In the end, after our optimizations, the real world workload 
> Dhruba and
>>>  I
>>>  > > were working with was not all in-memory so the bottleneck in 
> production
>>>  > > became the random reads (so increasing the block cache hit 
> ratio is the
>>>  > > focus) rather than CPU contention or HLog throughput.
>>>  > >
>>>  > > JG
>>>  > >
>>>  > > From: Akash Ashok [mailto:thehellma...@gmail.com]
>>>  > > Sent: Sunday, October 23, 2011 2:57 AM
>>>  > > To: dev@hbase.apache.org
>>>  > > Subject: Re: SILT - nice keyvalue store paper
>>>  > >
>>>  > > I was running some similar tests and came across a surprising 
> finding.
>>>  I
>>>  > > compared reads and write on ConcurrentSkipListMap ( which the 
> memstore
>>>  > > uses) and synchronized TreeMap ( Which was literally treemap
>>>  > > synchronized). Executed concurrent reads, writes and deletes 
> on both of
>>>  > > them.
>>>  > > Surprisingly synchronized treeMap performed better, though 
> just
>>>  slightly
>>>  > > better, than ConcurrentSkipListMap which KeyValueSkipListSet 
> uses.
>>>  > >
>>>  > > Here are the output of a few runs
>>>  > >
>>>  > > Sometimes the difference was considerable Using HBaseMap it 
> took
>>>  > > 20438ms Using TreeMap it took 11613ms Time Difference:8825ms
>>>  > >
>>>  > > And sometimes the difference was negligible Using HBaseMap it 
> took
>>>  > > 13370ms Using TreeMap it took 9482ms Time Difference:3888ms
>>>  > >
>>>  > > I've attaching the test  java file which I wrote to test 
> it.
>>>  > > This might be a very minor differece but still surprising 
> considering
>>>  the
>>>  > fact
>>>  > > that ConcurrentSkipListMap uses fancy 2 level indexes which 
> they say
>>>  > > improves the deletion performance.
>>>  > >
>>>  > > And here are the details about the test run.
>>>  > > 100 Threads each fetching 1,000,000 records
>>>  > > 100 threads each adding 1,000,000 records.
>>>  > > 100 threads each deletin 1,000,000 records ( Reads, Writes 
> and deletes
>>>  > > simultaneously )
>>>  > >
>>>  > > Cheers,
>>>  > > Akash A
>>>  > > On Sun, Oct 23, 2011 at 3:25 AM, Stack
>>>  > > <st...@duboce.net<mailto:st...@duboce.net>> 
> wrote:
>>>  > > On Sat, Oct 22, 2011 at 2:41 PM, N Keywal
>>>  > > <nkey...@gmail.com<mailto:nkey...@gmail.com>> 
> wrote:
>>>  > > > I would think that the bottleneck for insert is the wal 
> part?
>>>  > > > It would be possible to do a kind of memory list 
> preparation during
>>>  > > > the wal insertion, and if the wal insertion is 
> confirmed, do the
>>>  > > > insertion in the memory list. But it's strange to 
> have the insertion
>>>  in
>>>  > > memory important vs.
>>>  > > > the insertion on disk...
>>>  > > >
>>>  > > Yes, WAL is the long pole writing.  But MemStore has issues 
> too; Dhruba
>>>  > says
>>>  > > cpu above.  Reading and writing it is also 'slow'.
>>>  > > St.Ack
>>>  >
>>>  >
>>> 
>> 
>

Re: CSLM performance Was: SILT - nice keyvalue store paper

Reply via email to