Actually i wanted to 1st implement a simple 1-1 join , and if it shows some performance gain , i wanted to extend it to 1-many by making some modifications in the algo. I wanted to run this join in the realtime environment containing tables of billions of rows and also data increasing day by day . i wanted to take-up this approach because the major problem with RDBMS is scaling of data (I have read this from many websites and this is what led to the development of key-value databases). Actually when i asked you the question about comparing the performance of RDBMS and Join in HBASE i meant that the amount of data is huge (containing billions of rows ) and the difference is in a manchine running RDBMS data resides on a single machine and the processing is done on the same machine where as in the latter approach (which i wanted to implement) data resides in "hbase" and join is performed on different machines simultaneously (as in map-reduce) . Actually i have implemented a map-reduce variant of HashJoin on hadoop using just HDFS (i mean data is stored in HDFS files and not as in HBASE ). The details of that join are as follows
consider a simple 1-1 join *Map phase :* -> read in the input files (inner and outer relations ), parse them and emit the column value (which is participating in the join) -> After map phase is over all the values with same column value go to same system . (we can employ simple methods to know which tuple is frm inner relation and which is frm outer relation ) *Reduce Phase :* Since the matching columns will be on same system they can be joined and written to disk Can you tell me any changes i have to make inorder to implement a similar algo in hbase (which is column-oriented and varies greatly in IO/s as all the column families are grouped together) Please reply :) and btw sorry for a looooooong mail :) On Thu, Jul 16, 2009 at 8:49 PM, Jonathan Gray <[email protected]> wrote: > The answer is, it depends. > > What are the details of what you are trying to join? Is it just a simple > 1-to-1 join, or 1-to-many or what? At a minimum, the join would require two > round-trips. However, 0.20 can do simple queries in the 1-10ms time-range > (closer to 1ms when the blocks are already cached). > > The comparison to an RDBMS cannot be made directly because a single-node > RDBMS with a smallish table will be quite fast at simple index-based joins. > I would guess that unloaded, single machine performance of this join > operation would be much faster in an RDBMS. > > But if your table has millions or billions of rows, it's a different > situation. HBase performance will stay nearly constant as your table > increases, as long as you have the nodes to support your dataset and the > load. > > What are your targets for time (sub 100ms? 10ms?), and what are the details > of what you're joining? > > As far as code is concerned, there is not much to a simple join, so I'm not > sure how helpful it would be. If you give some detail perhaps I can provide > some pseudo-code for you. > > JG > > > bharath vissapragada wrote: > >> JG thanks for ur reply, >> >> Actually iam trying to implement a realtime join of two tables on HBase . >> Actually i tried the idea of denormalizing the tables to avoid the Joins , >> but when we do that Updating the data is really difficult . I understand >> that the features i am trying to implement are that of a RDBMS and HBase >> is >> used for a different purpose . Even then i want (rather i would like to >> try) >> to store the data the data in HBase and implement Joins so that i could >> test its performance and if its effective (atleast on large number of >> nodes) >> , it maybe of somehelp to me . I know some ppl have already tried this . >> If >> anyone of already tried this can you just tellme how the results are .. i >> mean are they good , when compared to RDBMS join on a single machine ... >> >> Thanks >> >> On Wed, Jul 15, 2009 at 8:35 PM, Jonathan Gray <[email protected]> wrote: >> >> Bharath, >>> >>> You need to outline what your actual requirements are if you want more >>> help. Open-ended questions that just ask for code are usually not >>> answered. >>> >>> What exactly are you trying to join? Does this join need to happen in >>> "realtime" or is this part of a batch process? >>> >>> Could you denormalize your data to prevent needing the join at runtime? >>> >>> If you provide details about exactly what your data/schema is like (or a >>> similar example if this is confidential), then many of us are more than >>> happy to help you figure out what approach my work best. >>> >>> When working with HBase, figuring out how you want to pull your data out >>> is >>> key to how you want to put the data in. >>> >>> JG >>> >>> >>> bharath vissapragada wrote: >>> >>> Amandeep , can you tell me what kinds of joins u have implemented ? and >>>> which works the best (based on observation ).. Can u show us the source >>>> code >>>> (if possible) >>>> >>>> Thanks in advance >>>> >>>> On Wed, Jul 15, 2009 at 10:46 AM, Amandeep Khurana <[email protected]> >>>> wrote: >>>> >>>> I've been doing joins by writing my own MR jobs. That works best. >>>> >>>>> Not tried cascading yet. >>>>> >>>>> -ak >>>>> >>>>> On 7/14/09, bharath vissapragada <[email protected]> >>>>> wrote: >>>>> >>>>> Thats fine .. I know that hbase has completely different usage >>>>>> compared >>>>>> >>>>>> to >>>>> >>>>> SQL .. But for my application there is some kind of dependency >>>>>> involved >>>>>> among the tables . So i need to implement a Join . I wanted to know >>>>>> >>>>>> whether >>>>> >>>>> there is some kind of implementation already >>>>>> .. >>>>>> >>>>>> Thanks >>>>>> On Wed, Jul 15, 2009 at 10:30 AM, Ryan Rawson <[email protected]> >>>>>> >>>>>> wrote: >>>>> >>>>> HBase != SQL. >>>>>> >>>>>>> You might want map reduce or cascading. >>>>>>> >>>>>>> On Tue, Jul 14, 2009 at 9:56 PM, bharath >>>>>>> vissapragada<[email protected]> wrote: >>>>>>> >>>>>>> Hi all , >>>>>>>> >>>>>>>> I want to join(similar to relational databases join) two tables in >>>>>>>> >>>>>>>> HBase >>>>>>> >>>>>> . >>>>>> >>>>>>> Can anyone tell me whether it is already implemented in the source ! >>>>>>>> >>>>>>>> Thanks in Advance >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>> >>>>> >>>>> Amandeep Khurana >>>>> Computer Science Graduate Student >>>>> University of California, Santa Cruz >>>>> >>>>> >>>>> >>
