Re: CSLM performance Was: SILT - nice keyvalue store paper

Akash Ashok Mon, 24 Oct 2011 23:14:53 -0700

On Mon, Oct 24, 2011 at 10:49 PM, Ted Yu <[email protected]> wrote:


> Akash:
> Take a look at the following two possible replacements for CSLM:
> https://github.com/mspiegel/lockfreeskiptree
> https://github.com/nbronson/snaptree

Thanks Ted :) :) :). Was pretty much on the lookout for other structures. I
tried
http://g.oswego.edu/dl/classes/EDU/oswego/cs/dl/util/concurrent/SyncSortedMap.html
but didn't perform that great. Will look into these replacements.


>
> About your test, I got advice from other expert:
>
> - the JVM warm-up penalty is being accrued by the CSLM run
> - Use thread.yield() to end a request, as otherwise the active thread may
> be
> able to run through its time slice without incurring much lock contention.
>

Hmm. I was just wondering. Since each thread 10000+ inserts and deletes are
being run
they have to context switch quite a few times during its lifecycle right ?
Wouldn't it be enough if I increase the number of iterations by a factor if
say 100 per thread ?



> You can publish your code and results under HBASE-3992.
>
> Thanks
>
> On Sun, Oct 23, 2011 at 5:05 PM, Jonathan Gray <[email protected]> wrote:
>
> > Oh, and when running these experiments, you should look at the impact at
> > which order they are run in, whether you run them multiple times per JVM
> > instance, etc.  Basically, you need to be cognizant of the HotSpot
> > optimizations the JVM is doing at runtime.
> >
> > > -----Original Message-----
> > > From: Jonathan Gray [mailto:[email protected]]
> > > Sent: Sunday, October 23, 2011 4:20 PM
> > > To: [email protected]
> > > Subject: RE: SILT - nice keyvalue store paper
> > >
> > > Very nice experiment, Akash.  Keep getting your hands dirty and
> digging!
> >  :)
> > >
> > > I think your results might change if you bump the test up to 1000
> threads
> > or
> > > so.  100 threads can still perform okay when there's a global lock but
> > the
> > > contention at 1000 threads will kill you and that's when CSLM should do
> > much
> > > better.  (1000 handler threads is approx. what I run with on RS in
> prod).
> > > Though I am a bit surprised that at 100 threads the TreeMap was
> > significantly
> > > faster.  Your inconsistent results are a bit odd, you might try an
> order
> > of
> > > magnitude more operations per thread.  You might also gather some
> > > statistics about tree size and per operation latency.
> > >
> > > I've done some isolated CSLM benchmarks in the past and have never been
> > > able to reproduce any of the slowness people suggest.  I recall trying
> > some
> > > impractically large MemStores and everything still being quite fast.
> > >
> > > Over in Cassandra, I believe they have a two-level CSLM with the first
> > map
> > > key being the row and then the columns for each row in their own CSLM.
> > > I've been told this is somewhat of a pain point for them.  And keep in
> > mind
> > > they have one shard/region per node and we generally have several
> smaller
> > > MemStores on each node (tens to thousands).  Not sure we would want to
> > > try that.  There could be some interesting optimizations if you had
> very
> > > specific issues, like if you had a ton of reads to MemStore and not
> many
> > > writes you could keep some kind of mirrored hashmap.
> > >
> > > And for writes, the WAL is definitely the latency bottleneck.  But if
> you
> > are
> > > doing lots of small operations, so your WALEdits are not large, and
> with
> > some
> > > of the HLog batching features going in to trunk, you end up with
> hundreds
> > of
> > > requests per HLog sync.  And although the syncs are higher latency,
> with
> > > batching you end up getting high throughput.  And the bottleneck
> shifts.
> > >
> > > Each sync will take approx. 1-5ms, so let's say 250 requests per HLog
> > sync
> > > batch, 4ms per sync, so 62.5k req/sec.  (62.5k * 100 bytes/req =
> > 600K/sec,
> > > very reasonable).  If you're mixing in reads as well (or if you're
> doing
> > > increments which do a read and write), then this adds to the CPU usage
> > and
> > > contention without adding to HLog throughput.
> > >
> > > All of a sudden the bottleneck becomes CPU/contention and not HLog
> > > latency or throughput.  Highly concurrent increments/counters with a
> > largely
> > > in-memory dataset can easily be CPU bottlenecked.
> > >
> > > For one specific application Dhruba and I worked on, we made some good
> > > improvements in CPU efficiency by reducing the number of operations and
> > > increasing efficiency on the CSLM.  Doing things like always taking a
> > tailMap
> > > and working from that instead of starting at the root node, using an
> > iterator()
> > > and taking advantage of the available remove() semantics, or simply
> just
> > > mutating things that are normally immutable :)  Unfortunately many of
> > these
> > > optimizations were semi-horrid hacks and introduced things like
> > > ModifiableKeyValues, so they all haven't made their way to apache.
> > >
> > > In the end, after our optimizations, the real world workload Dhruba and
> I
> > > were working with was not all in-memory so the bottleneck in production
> > > became the random reads (so increasing the block cache hit ratio is the
> > > focus) rather than CPU contention or HLog throughput.
> > >
> > > JG
> > >
> > > From: Akash Ashok [mailto:[email protected]]
> > > Sent: Sunday, October 23, 2011 2:57 AM
> > > To: [email protected]
> > > Subject: Re: SILT - nice keyvalue store paper
> > >
> > > I was running some similar tests and came across a surprising finding.
> I
> > > compared reads and write on ConcurrentSkipListMap ( which the memstore
> > > uses) and synchronized TreeMap ( Which was literally treemap
> > > synchronized). Executed concurrent reads, writes and deletes on both of
> > > them.
> > > Surprisingly synchronized treeMap performed better, though just
> slightly
> > > better, than ConcurrentSkipListMap which KeyValueSkipListSet uses.
> > >
> > > Here are the output of a few runs
> > >
> > > Sometimes the difference was considerable Using HBaseMap it took
> > > 20438ms Using TreeMap it took 11613ms Time Difference:8825ms
> > >
> > > And sometimes the difference was negligible Using HBaseMap it took
> > > 13370ms Using TreeMap it took 9482ms Time Difference:3888ms
> > >
> > > I've attaching the test  java file which I wrote to test it.
> > > This might be a very minor differece but still surprising considering
> the
> > fact
> > > that ConcurrentSkipListMap uses fancy 2 level indexes which they say
> > > improves the deletion performance.
> > >
> > > And here are the details about the test run.
> > > 100 Threads each fetching 1,000,000 records
> > > 100 threads each adding 1,000,000 records.
> > > 100 threads each deletin 1,000,000 records ( Reads, Writes and deletes
> > > simultaneously )
> > >
> > > Cheers,
> > > Akash A
> > > On Sun, Oct 23, 2011 at 3:25 AM, Stack
> > > <[email protected]<mailto:[email protected]>> wrote:
> > > On Sat, Oct 22, 2011 at 2:41 PM, N Keywal
> > > <[email protected]<mailto:[email protected]>> wrote:
> > > > I would think that the bottleneck for insert is the wal part?
> > > > It would be possible to do a kind of memory list preparation during
> > > > the wal insertion, and if the wal insertion is confirmed, do the
> > > > insertion in the memory list. But it's strange to have the insertion
> in
> > > memory important vs.
> > > > the insertion on disk...
> > > >
> > > Yes, WAL is the long pole writing.  But MemStore has issues too; Dhruba
> > says
> > > cpu above.  Reading and writing it is also 'slow'.
> > > St.Ack
> >
> >
>

Re: CSLM performance Was: SILT - nice keyvalue store paper

Reply via email to