This might be a simplified view, but this is how I understand it..


HBase stores the data in a distributed way, by using various RegionServers.

MapReduce distributes computations, by using TaskTrackers.

So when a MapReduce job is run it tries to run Map/Reduce operations close run on TaskTrackers co-located with RegionServers serving the data.. thus co-locating the computations with the data..


So, if you use Map/Reduce, you get computation and data distribution by default, as well as a best effort to co-locate computation with data, thus maxing out efficiency as much as possible.

Now, you don't have to use Map/Reduce, but then you will have to take the extra effort to distribute your computations, and try to co-locate them close to the data..

That is in fact something that I'm planning on doing, since I'm not sure yet if my computations are suited for Map/Reduce.. So I will probably run my own Java process co-located with the Hbase RegionServers.. And make sure that when my code asks for data, it gets the local data as much as possible...




On 7/21/09 11:48 PM, bharath vissapragada wrote:
That means .. it is not very useful to write java codes (using API)  ..
because any way it is not using the real power of hadoop(distributed
processing) instead it has the overhead of fetching data from other machines
right?

On Wed, Jul 22, 2009 at 12:12 PM, Amandeep Khurana<[email protected]>  wrote:

Yes.. Only if you use MR. If you are writing your own code, it'll pull the
records to the place where you run the code.

On Tue, Jul 21, 2009 at 11:39 PM, Fernando Padilla<[email protected]
wrote:
That is if you use Hadoop MapReduce right? Not if you simply access HBase
through a standard api (like java)?



On 7/21/09 9:49 PM, Amandeep Khurana wrote:

Bharath,

The processing is done as local to the RS as possible. The first attempt
is
at doing it local on the same node. If thats not possible, its done on
the
same rack.

-ak


On Tue, Jul 21, 2009 at 9:43 PM, bharath vissapragada<
[email protected]>   wrote:

  Hi all,
I have one simple doubt in hbase ,

Suppose i use a scanner to iterate through all the rows in the hbase
and
process the data in the table corresponding to those rows .Is the
processing
of that data done locally on the region server in which that particular
region is located or is it transferred over network so that all the
processing is done on a single machine on which that script runs!!

thanks



Reply via email to