There is nothing crazy with async I/O. Netty (which I presume is underlying 
network library in a latest Hadoop) is totally async.
 You can run 1000, 10,000 and may be more threads on Linux but performance 
-wise this would sub-optimal decision.
With 20 ms time share, on 8 core CPU some threads can wait their schedule slot 
for up to 2.5 seconds (1000 threads) or 
25 sec (10000 secs) or 250 secs (100,000). This is worst case scenario, of 
course. If you do not care about latency it probably won't hurt you too much. 
Another problems with a large number of threads scheduling 

1. is time which kernel scheduler spends doing nothing (from the application 
point of view).
2. L1/L2/L3 cache trashing

One more: having total number of active threads less than a number of physical 
cores (or threads in SPARC TX) is #1 requirement of a high-performance 
application. 
This allows you to cut on threads synchronization cost by utilizing such 
expensive things like spin locks (busy loops). Stack has already
mentioned LMAX Disruptor framework which is totally bypasses standard Java 
thread synchronizations (they use CAS, busy loops and thread yielding). 

Best regards,
Vladimir Rodionov
Principal Platform Engineer
Carrier IQ, www.carrieriq.com
e-mail: vrodio...@carrieriq.com

________________________________________
From: lars hofhansl [lhofha...@yahoo.com]
Sent: Monday, October 24, 2011 12:04 PM
To: dev@hbase.apache.org
Subject: Re: SILT - nice keyvalue store paper

I am not sure I would generally agree with this statement.
The linux kernel can easily handle 10.000's of threads (just need to keep 
default stack small).
(there were tests done with 1m threads too).

Doing async IO is all the craze today (see node.js and friends), but that also 
is not necessarily done with full understanding of the
performance characteristics.

Just my $0.02.


----- Original Message -----
From: Vladimir Rodionov <vrodio...@carrieriq.com>
To: "dev@hbase.apache.org" <dev@hbase.apache.org>
Cc:
Sent: Monday, October 24, 2011 11:43 AM
Subject: RE: SILT - nice keyvalue store paper

Jonathan, 1000 threads is a bad application design. They will kill you even w/o 
contention. In my opinion, over-usage of a *modern*  Java concurrent framework
stuff w/o real understanding of a benefits is a major contributor to a poor 
performance and poor scalability of a Java application. Before making any 
decision
on using a fancy data structure which can potentially affect application 
performance it is always a good idea to run some benchmarks.


Best regards,
Vladimir Rodionov
Principal Platform Engineer
Carrier IQ, www.carrieriq.com
e-mail: vrodio...@carrieriq.com

________________________________________
From: Jonathan Gray [jg...@fb.com]
Sent: Sunday, October 23, 2011 4:20 PM
To: dev@hbase.apache.org
Subject: RE: SILT - nice keyvalue store paper

Very nice experiment, Akash.  Keep getting your hands dirty and digging!  :)

I think your results might change if you bump the test up to 1000 threads or 
so.  100 threads can still perform okay when there's a global lock but the 
contention at 1000 threads will kill you and that's when CSLM should do much 
better.  (1000 handler threads is approx. what I run with on RS in prod).  
Though I am a bit surprised that at 100 threads the TreeMap was significantly 
faster.  Your inconsistent results are a bit odd, you might try an order of 
magnitude more operations per thread.  You might also gather some statistics 
about tree size and per operation latency.

I've done some isolated CSLM benchmarks in the past and have never been able to 
reproduce any of the slowness people suggest.  I recall trying some 
impractically large MemStores and everything still being quite fast.

Over in Cassandra, I believe they have a two-level CSLM with the first map key 
being the row and then the columns for each row in their own CSLM.  I've been 
told this is somewhat of a pain point for them.  And keep in mind they have one 
shard/region per node and we generally have several smaller MemStores on each 
node (tens to thousands).  Not sure we would want to try that.  There could be 
some interesting optimizations if you had very specific issues, like if you had 
a ton of reads to MemStore and not many writes you could keep some kind of 
mirrored hashmap.

And for writes, the WAL is definitely the latency bottleneck.  But if you are 
doing lots of small operations, so your WALEdits are not large, and with some 
of the HLog batching features going in to trunk, you end up with hundreds of 
requests per HLog sync.  And although the syncs are higher latency, with 
batching you end up getting high throughput.  And the bottleneck shifts.

Each sync will take approx. 1-5ms, so let's say 250 requests per HLog sync 
batch, 4ms per sync, so 62.5k req/sec.  (62.5k * 100 bytes/req = 600K/sec, very 
reasonable).  If you're mixing in reads as well (or if you're doing increments 
which do a read and write), then this adds to the CPU usage and contention 
without adding to HLog throughput.

All of a sudden the bottleneck becomes CPU/contention and not HLog latency or 
throughput.  Highly concurrent increments/counters with a largely in-memory 
dataset can easily be CPU bottlenecked.

For one specific application Dhruba and I worked on, we made some good 
improvements in CPU efficiency by reducing the number of operations and 
increasing efficiency on the CSLM.  Doing things like always taking a tailMap 
and working from that instead of starting at the root node, using an iterator() 
and taking advantage of the available remove() semantics, or simply just 
mutating things that are normally immutable :)  Unfortunately many of these 
optimizations were semi-horrid hacks and introduced things like 
ModifiableKeyValues, so they all haven't made their way to apache.

In the end, after our optimizations, the real world workload Dhruba and I were 
working with was not all in-memory so the bottleneck in production became the 
random reads (so increasing the block cache hit ratio is the focus) rather than 
CPU contention or HLog throughput.

JG

From: Akash Ashok [mailto:thehellma...@gmail.com]
Sent: Sunday, October 23, 2011 2:57 AM
To: dev@hbase.apache.org
Subject: Re: SILT - nice keyvalue store paper

I was running some similar tests and came across a surprising finding. I 
compared reads and write on ConcurrentSkipListMap ( which the memstore uses) 
and synchronized TreeMap ( Which was literally treemap synchronized). Executed 
concurrent reads, writes and deletes on both of them.
Surprisingly synchronized treeMap performed better, though just slightly 
better, than ConcurrentSkipListMap which KeyValueSkipListSet uses.

Here are the output of a few runs

Sometimes the difference was considerable
Using HBaseMap it took 20438ms
Using TreeMap it took 11613ms
Time Difference:8825ms

And sometimes the difference was negligible
Using HBaseMap it took 13370ms
Using TreeMap it took 9482ms
Time Difference:3888ms

I've attaching the test  java file which I wrote to test it.
This might be a very minor differece but still surprising considering the fact 
that ConcurrentSkipListMap uses fancy 2 level indexes which they say improves 
the deletion performance.

And here are the details about the test run.
100 Threads each fetching 1,000,000 records
100 threads each adding 1,000,000 records.
100 threads each deletin 1,000,000 records
( Reads, Writes and deletes simultaneously )

Cheers,
Akash A
On Sun, Oct 23, 2011 at 3:25 AM, Stack 
<st...@duboce.net<mailto:st...@duboce.net>> wrote:
On Sat, Oct 22, 2011 at 2:41 PM, N Keywal 
<nkey...@gmail.com<mailto:nkey...@gmail.com>> wrote:
> I would think that the bottleneck for insert is the wal part?
> It would be possible to do a kind of memory list preparation during the wal
> insertion, and if the wal insertion is confirmed, do the insertion in the
> memory list. But it's strange to have the insertion in memory important vs.
> the insertion on disk...
>
Yes, WAL is the long pole writing.  But MemStore has issues too;
Dhruba says cpu above.  Reading and writing it is also 'slow'.
St.Ack

Reply via email to