Re: Few questions about map reduce in Hbase

Nishant Khurana Sun, 16 Nov 2008 13:10:48 -0800

Hi Jonathan,
Thanks for your reply. That made things lot clear to me. But there are more
questions :) .
-- What is the best way to build a index over a field in Hbase ? Do I have
to build it in a custom way and store it on HDFS. If I have a query (not on
HQL) like selection over 2 fields out of which one is the row_id and other
is some other column, I can easily figure out which regions that belongs to
and if I have a index on other column too, that can again give me another
set of regions which I can intersect to get the final one.
-- Is there a way I can iterate thorough all the rows of a particular region
only (for a particular relation)? Table Map will do it for all If I am not
wrong.


Well I would need joins if say individual relations are managed
independently and people would like to see data from all the relations for a
query. Its like sharing of data amongst a topic specific community. So what
should be the best way to implement sort merge joins. I hope that should be
easiest to start with.

Thanks for the response Jonathan.

On Sun, Nov 16, 2008 at 3:34 PM, Jonathan Gray <[EMAIL PROTECTED]> wrote:

> > Hi,
> > I am new to Hadoop and Hbase. I am trying to understand how to use map
> > reduce with Hbase as source and sink and had following questions. Would
> > appreciate if someone can answer them and may be point me to some
> > sample
> > code:
> >
> > -- As far as I understood, the tables gets stored in different regions
> > in
> > Hbase which are split across various nodes in HDFS. Is there a way to
> > control the amount of replication of a particular table ?
>
> The regions are split across the different region servers, the contents of
> each region is made up of many different files/blocks which are then
> replicated across the nodes of HDFS.  Replication is set in HDFS, HBase has
> not concept of replication.  Therefore it's not possible (as far as I know)
> to set per-table replication levels.  If it was possible to set
> per-directory replication settings in HDFS, then this might be possible,
> I'm
> unsure if that is possible though I think it is a global setting.
>
>
> > --When we try to use a table scanner, it automatically switches between
> > various regions of a table which may be present across different nodes
> > and
> > returns us the row handle. So it is a single process doing that. Am I
> > correct ?
>
> The META table (which is stored in regions/on regionservers like any other
> table) contains the start/end key and node locations of all other tables
> and
> their regions.  When using a scanner, it will start with the region which
> includes your startrow (first region of the table if no startrow given) and
> once you have reached the end of the current region, you will use META
> information to find the next region.  Your scanner will then continue in
> that region, which might be on a different node.
>
>
> > -- When we use TableMap to run map reduce jobs on Hbase, it
> > automatically
> > creates several map jobs i.e. one per region and performs map operation
> > on
> > the key range of that particular region. So if I use a table scanner
> > inside
> > a map job, will I be still iterating through only row ranges of that
> > particular region or again the whole table ?
>
> If you're using HTable.getScanner within an MR job, it will have the same
> behavior as anywhere else.  You will be iterating through the whole table.
>
>
> > -- What is the best way if I may want to iterate through all the rows
> > for a
> > particualr region in a map job. This may be required to perform a
> > select
> > operation parallely.
>
> That is exactly what you are doing by using TableMap as the input to the MR
> job.  Each map task is a scanner through a single region.  You do not need
> to create a scanner within the map().  There will be a call to the map()
> for
> each row in that region, a task for each region in the table.
>
>
> > Sorry for the long email. Many of the questions may be basic. I
> > appreciate
> > if someone can answer them. Also any suggestions of implementing joins
> > using
> > map reduce on hbase.
> > Thanks
>
> Can you be more specific?  HBase is not typically meant for joining data,
> though there are certainly plenty of valid cases for doing so.  You may be
> able to get around it with better structuring of your data (denormalization
> is your friend), otherwise it's certainly possible to do with MR depending
> on the specifics.
>
> Hope that helps.  Don't hesitate to ask more questions, that's what the
> list
> is for, but don't forget to read the HBase Architecture docs, the other
> wiki
> pages, and to search the mailing list archives as well.
>
> Jonathan Gray
>
>


-- 
Nishant Khurana
Candidate for Masters in Engineering (Dec 2009)
Computer and Information Science
School of Engineering and Applied Science
University of Pennsylvania

Re: Few questions about map reduce in Hbase

Reply via email to