Thanks, that would be great.

Actually the code is perl, I'm using streaming to do the map-reduce 
(bioinformatics data that we have lots of perl libraries for). So far on a 
single thread it works quite well (in house we get ~300 rows/sec, on EC2 maybe 
half that with indexes), usually with the perl being the bottle neck and HBase 
just soaking it up. We were hoping to throw more CPUs at it to increase the 
load speed.

-chris

On Apr 30, 2010, at 5:01 PM, Jean-Daniel Cryans wrote:

> Not sure why you are going through thrift if you are already using
> java (you want to test thrift's speed because java isn't your main dev
> language?) but it will maybe add 1ms or 2, really not that bad. Here
> at StumbleUpon we use thrift to get our php website to talk to HBase
> and on average we stay under 10ms for random gets. Our machines are
> 2xi7, 24GB, 4x1TB sata.
> 
> My coworker (Stack) pinged the author of the contrib to see if he can
> make a patch for your issue.
> 
> J-D
> 
> On Fri, Apr 30, 2010 at 4:51 PM, Chris Tarnas <c...@email.com> wrote:
>> 
>> On Apr 30, 2010, at 4:44 PM, Jean-Daniel Cryans wrote:
>> 
>>> On Fri, Apr 30, 2010 at 4:32 PM, Chris Tarnas <c...@email.com> wrote:
>>>> 
>>>> 
>>>> I'm also using thrift to connect and am wondering if that itself puts an 
>>>> overall limit on scaling? It does seem that no matter how many more 
>>>> mappers and servers I add, even without indexing, I am capped at about 5k 
>>>> rows/sec total. I'm waiting a bit as the table grows so that it is split 
>>>> across more regionservers, hopefully that will help, but as far as I can 
>>>> tell I am not hitting any CPU or IO constraint during my tests.
>>> 
>>> I don't understand the "I'm also using thrift" and "how many more
>>> mappers" part, you are using Thrift inside a map? Anyways, more
>>> clients won't help since there's a single mega serialization of all
>>> the inserts to the index table per region server. It's normal not to
>>> see any CPU/mem/IO contention since, in this case, it's all about the
>>> speed at which you can process a single row insertion The rest of the
>>> threads just wait...
>>> 
>> 
>> Sorry - should have been more clear. I'm testing now with a normal tables 
>> and regionservers and I seem to cap out at about 5-7k rows a second for 
>> inserts. My method for doing inserts is to use map reduce on hadoop to 
>> launch many insert processes, each process uses the local thrift server on 
>> each node to connect to hbase. In this case I hope that other threads can 
>> insert at the same time.
>> 
>> -chris
>> 
>> 
>> 

Reply via email to