Re: SILT - nice keyvalue store paper

Akash Ashok Mon, 24 Oct 2011 22:40:30 -0700

On Mon, Oct 24, 2011 at 4:50 AM, Jonathan Gray <jg...@fb.com> wrote:

> Very nice experiment, Akash.  Keep getting your hands dirty and digging!
>  :)
>
> I think your results might change if you bump the test up to 1000 threads
> or so.  100 threads can still perform okay when there's a global lock but
> the contention at 1000 threads will kill you and that's when CSLM should do
> much better.  (1000 handler threads is approx. what I run with on RS in
> prod).  Though I am a bit surprised that at 100 threads the TreeMap was
> significantly faster.  Your inconsistent results are a bit odd, you might
> try an order of magnitude more operations per thread.  You might also gather
> some statistics about tree size and per operation latency.
>


Hey JG . At 1000 threads the situation reversed, CSLM performed much better
as you said. I just took it a little overboard just to see what happens at
10,000  threads. TreeMap globally synchronized again performed better
consistently. which was super duper surprising, I am just trying to figure
out why this is happening. Will post the results soon.


> I've done some isolated CSLM benchmarks in the past and have never been
> able to reproduce any of the slowness people suggest.  I recall trying some
> impractically large MemStores and everything still being quite fast.


> Over in Cassandra, I believe they have a two-level CSLM with the first map
> key being the row and then the columns for each row in their own CSLM.  I've
> been told this is somewhat of a pain point for them.  And keep in mind they
> have one shard/region per node and we generally have several smaller
> MemStores on each node (tens to thousands).  Not sure we would want to try
> that.  There could be some interesting optimizations if you had very
> specific issues, like if you had a ton of reads to MemStore and not many
> writes you could keep some kind of mirrored hashmap.
>
> And for writes, the WAL is definitely the latency bottleneck.  But if you
> are doing lots of small operations, so your WALEdits are not large, and with
> some of the HLog batching features going in to trunk, you end up with
> hundreds of requests per HLog sync.  And although the syncs are higher
> latency, with batching you end up getting high throughput.  And the
> bottleneck shifts.
>
> Each sync will take approx. 1-5ms, so let's say 250 requests per HLog sync
> batch, 4ms per sync, so 62.5k req/sec.  (62.5k * 100 bytes/req = 600K/sec,
> very reasonable).  If you're mixing in reads as well (or if you're doing
> increments which do a read and write), then this adds to the CPU usage and
> contention without adding to HLog throughput.
>
> All of a sudden the bottleneck becomes CPU/contention and not HLog latency
> or throughput.  Highly concurrent increments/counters with a largely
> in-memory dataset can easily be CPU bottlenecked.
>
> For one specific application Dhruba and I worked on, we made some good
> improvements in CPU efficiency by reducing the number of operations and
> increasing efficiency on the CSLM.  Doing things like always taking a
> tailMap and working from that instead of starting at the root node, using an
> iterator() and taking advantage of the available remove() semantics, or
> simply just mutating things that are normally immutable :)  Unfortunately
> many of these optimizations were semi-horrid hacks and introduced things
> like ModifiableKeyValues, so they all haven't made their way to apache.
>
> In the end, after our optimizations, the real world workload Dhruba and I
> were working with was not all in-memory so the bottleneck in production
> became the random reads (so increasing the block cache hit ratio is the
> focus) rather than CPU contention or HLog throughput.
>
> JG
>
> From: Akash Ashok [mailto:thehellma...@gmail.com]
> Sent: Sunday, October 23, 2011 2:57 AM
> To: dev@hbase.apache.org
> Subject: Re: SILT - nice keyvalue store paper
>
> I was running some similar tests and came across a surprising finding. I
> compared reads and write on ConcurrentSkipListMap ( which the memstore uses)
> and synchronized TreeMap ( Which was literally treemap synchronized).
> Executed concurrent reads, writes and deletes on both of them.
> Surprisingly synchronized treeMap performed better, though just slightly
> better, than ConcurrentSkipListMap which KeyValueSkipListSet uses.
>
> Here are the output of a few runs
>
> Sometimes the difference was considerable
> Using HBaseMap it took 20438ms
> Using TreeMap it took 11613ms
> Time Difference:8825ms
>
> And sometimes the difference was negligible
> Using HBaseMap it took 13370ms
> Using TreeMap it took 9482ms
> Time Difference:3888ms
>
> I've attaching the test  java file which I wrote to test it.
> This might be a very minor differece but still surprising considering the
> fact that ConcurrentSkipListMap uses fancy 2 level indexes which they say
> improves the deletion performance.
>
> And here are the details about the test run.
> 100 Threads each fetching 1,000,000 records
> 100 threads each adding 1,000,000 records.
> 100 threads each deletin 1,000,000 records
> ( Reads, Writes and deletes simultaneously )
>
> Cheers,
> Akash A
> On Sun, Oct 23, 2011 at 3:25 AM, Stack <st...@duboce.net<mailto:
> st...@duboce.net>> wrote:
> On Sat, Oct 22, 2011 at 2:41 PM, N Keywal <nkey...@gmail.com<mailto:
> nkey...@gmail.com>> wrote:
> > I would think that the bottleneck for insert is the wal part?
> > It would be possible to do a kind of memory list preparation during the
> wal
> > insertion, and if the wal insertion is confirmed, do the insertion in the
> > memory list. But it's strange to have the insertion in memory important
> vs.
> > the insertion on disk...
> >
> Yes, WAL is the long pole writing.  But MemStore has issues too;
> Dhruba says cpu above.  Reading and writing it is also 'slow'.
> St.Ack
>
>

Re: SILT - nice keyvalue store paper

Reply via email to