RE: Is the thrift server a likely bottleneck?

Hegner, Travis Thu, 03 Sep 2009 07:50:04 -0700

Hi All,

I've used thrift from PHP, and have done bulk imports (multiple rows in one 
write). Here is some pseudo code:

<?php

$mutations = array(
  new Mutation(array('column'=>'cf:q1',value=>'v1')),
  new Mutation(array('column'=>'cf:q2',value=>'v2')),
  new Mutation(array('column'=>'cf:q3',value=>'v3'))
);

$batch[0] = new BatchMutation(array('row'=>'r1', 'mutations'=>$mutations));
$batch[1] = new BatchMutation(array('row'=>'r2', 'mutations'=>$mutations));
$batch[2] = new BatchMutation(array('row'=>'r3', 'mutations'=>$mutations));

$client->mutateRows('Table',$batch);

?>

This will insert 3 rows, with three columns each in (what I assume is) one 
connection, and one 'batch' upload. My testing with 130k large rows has not 
given me any reason to believe that my assumption is not true.

To answer the bottleneck question, I believe that thrift would eventually 
become a bottleneck if under a heavy enough load. You could use a dedicated 
thrift server to help. A rudimentary solution may be to launch thrift on each 
region server and do a simple DNS round robin, but that won't work as well as 
Ryan's suggestion. With either of those your still not guaranteed to connect to 
the thrift server which houses the requested data locally, I would imagine that 
thrift would require significant work in order to provide that functionality.

Another possibility (one that I used) would be to run thrift on a machine 
outside of the cluster (ideally, on any/all machines making thrift requests) 
and then the thrift requests could always point to 'localhost', and you would 
be accessing the cluster almost exactly as if you coded with the native java 
hbase client. I just copied my entire hbase software and config onto my 
workstation, launched thrift, and configured my php script to connect to 
localhost, instead of the original cluster-housed thrift server.

In a web environment, you could just run thrift on all of your web servers, and 
the web requests would be proxied through the local thrift instance into hbase, 
so the thrift load/capacity would scale exactly with your web servers.

Just a thought, Hope this helps,

Travis Hegner
http://www.travishegner.com/

-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Jean-Daniel 
Cryans
Sent: Thursday, September 03, 2009 7:24 AM
To: [email protected]; [email protected]
Subject: Re: Is the thrift server a likely bottleneck?

Silvain,

A BatchMutation is for a single row and multiple columns (for that
row) so in the HBase Thrift API you cannot batch insert many rows. In
the Java API the equivalent to BatchMutation is Put (which before was
named BatchInsert but people got confused, just like now).

J-D

On Thu, Sep 3, 2009 at 4:25 AM, Sylvain Hellegouarch<[email protected]> wrote:
>
>> Thrift spawns as many threads as requests, so running more than one
>> shouldn't benefit you much I think?
>
> Being a little unaware of Java's cleverness with threads I cannot really
> say but you're probably right.
>
>>
>> I run 1 thriftserver per regionserver, co existing, and then use
>> TSocketPool on the client side to spread load around.
>>
>> But generally, YES, the thrift server could be a bottleneck.  The main
>> problem with thrift and performance is you cannot control the scanner
>> caching directly, and you cannot do bulk commits.  Both of those would
>> require some API changes, and while not impossible, just hasn't been
>> prioritized.
>
> I'm a little confused then as what is the difference between the bulk
> commit you mention and batch mutations support in the thrift interface.
>
> Moreover, the Hbase 0.20 API is a bit unclear as to when the commit is
> done when using Put. In fact I'm a little unclear as to what is the best
> practice to write lots of rows so that it is as efficient as it can. One
> by one? Batch Mutations?
>
>>
>> Personally, we use thrift for php scripts, and use the Java API for
>> map-reduces and bulk data operations. Thus achieving the best of both
>> worlds: cross language access from PHP and the faster Java-based API
>> for certain scenarios.
>
> We will be using Pig Latin probably for the M/R with a Java adapter to
> fetch rows from HBase. However we do use Python for writing and I'm
> willing to use Jython but that would probably create other dependencies
> issue that I'd be happy to avoid if Thrift is good enough :)
>
> Thanks,
> - Sylvain
>
>
> --
> Sylvain Hellegouarch
> http://www.defuze.org
>

The information contained in this communication is confidential and is intended 
only for the use of the named recipient.  Unauthorized use, disclosure, or 
copying is strictly prohibited and may be unlawful.  If you have received this 
communication in error, you should know that you are bound to confidentiality, 
and should please immediately notify the sender or our IT Department at  
866.459.4599.

RE: Is the thrift server a likely bottleneck?

Reply via email to