Hi Jonathan, Thanks for your reply. That made things lot clear to me. But there are more questions :) . -- What is the best way to build a index over a field in Hbase ? Do I have to build it in a custom way and store it on HDFS. If I have a query (not on HQL) like selection over 2 fields out of which one is the row_id and other is some other column, I can easily figure out which regions that belongs to and if I have a index on other column too, that can again give me another set of regions which I can intersect to get the final one. -- Is there a way I can iterate thorough all the rows of a particular region only (for a particular relation)? Table Map will do it for all If I am not wrong.
Well I would need joins if say individual relations are managed independently and people would like to see data from all the relations for a query. Its like sharing of data amongst a topic specific community. So what should be the best way to implement sort merge joins. I hope that should be easiest to start with. Thanks for the response Jonathan. On Sun, Nov 16, 2008 at 3:34 PM, Jonathan Gray <[EMAIL PROTECTED]> wrote: > > Hi, > > I am new to Hadoop and Hbase. I am trying to understand how to use map > > reduce with Hbase as source and sink and had following questions. Would > > appreciate if someone can answer them and may be point me to some > > sample > > code: > > > > -- As far as I understood, the tables gets stored in different regions > > in > > Hbase which are split across various nodes in HDFS. Is there a way to > > control the amount of replication of a particular table ? > > The regions are split across the different region servers, the contents of > each region is made up of many different files/blocks which are then > replicated across the nodes of HDFS. Replication is set in HDFS, HBase has > not concept of replication. Therefore it's not possible (as far as I know) > to set per-table replication levels. If it was possible to set > per-directory replication settings in HDFS, then this might be possible, > I'm > unsure if that is possible though I think it is a global setting. > > > > --When we try to use a table scanner, it automatically switches between > > various regions of a table which may be present across different nodes > > and > > returns us the row handle. So it is a single process doing that. Am I > > correct ? > > The META table (which is stored in regions/on regionservers like any other > table) contains the start/end key and node locations of all other tables > and > their regions. When using a scanner, it will start with the region which > includes your startrow (first region of the table if no startrow given) and > once you have reached the end of the current region, you will use META > information to find the next region. Your scanner will then continue in > that region, which might be on a different node. > > > > -- When we use TableMap to run map reduce jobs on Hbase, it > > automatically > > creates several map jobs i.e. one per region and performs map operation > > on > > the key range of that particular region. So if I use a table scanner > > inside > > a map job, will I be still iterating through only row ranges of that > > particular region or again the whole table ? > > If you're using HTable.getScanner within an MR job, it will have the same > behavior as anywhere else. You will be iterating through the whole table. > > > > -- What is the best way if I may want to iterate through all the rows > > for a > > particualr region in a map job. This may be required to perform a > > select > > operation parallely. > > That is exactly what you are doing by using TableMap as the input to the MR > job. Each map task is a scanner through a single region. You do not need > to create a scanner within the map(). There will be a call to the map() > for > each row in that region, a task for each region in the table. > > > > Sorry for the long email. Many of the questions may be basic. I > > appreciate > > if someone can answer them. Also any suggestions of implementing joins > > using > > map reduce on hbase. > > Thanks > > Can you be more specific? HBase is not typically meant for joining data, > though there are certainly plenty of valid cases for doing so. You may be > able to get around it with better structuring of your data (denormalization > is your friend), otherwise it's certainly possible to do with MR depending > on the specifics. > > Hope that helps. Don't hesitate to ask more questions, that's what the > list > is for, but don't forget to read the HBase Architecture docs, the other > wiki > pages, and to search the mailing list archives as well. > > Jonathan Gray > > -- Nishant Khurana Candidate for Masters in Engineering (Dec 2009) Computer and Information Science School of Engineering and Applied Science University of Pennsylvania
