Hi , Did you see the algo of the map-red join i have implemented (i have written it in the end of my prev mail.). Any comments abt that .. I mean any improvements and stuff .
On Thu, Jul 16, 2009 at 11:12 PM, Jonathan Gray <[email protected]> wrote: > Hoberto, Bharath, > > Designing these kinds of queries efficiently in HBase means doing multiple > round-trips, or denormalizing. > > That means degrading performance as the query complexity increases, or lots > of data duplication and a complex write/update process. > > In your audit example, you provide the denormalizing solution. Store the > fields you need with the data you are querying (details of the user/action > in the audit table with the audit). If you have to update those details, > then you have an extra expense on your write (and you introduce a potential > synchronization issue without transactions). > > The choice about how to solve this really depends on the use case and what > your requirements are. Can you ever afford to miss an update in one of the > denormalized fields, even if it is extremely unlikely? You can build > transactional layers on top or you can take a look at TransactionalHBase > which attempts to do this in a more integrated way. > > You also talk about the other approach, running multiple queries in the > application. As far as memory pressure in the app is concerned, that would > really depend on the nature of the join. It's more an issue of how many > joins you need to make, and if there's any way to reduce the number of > queries/trips needed. > > If I am pulling the most recent 10 audits, and I need to join each with > both the User and Actions table, then we're talking about 1 + 10 + 10 total > queries. That's not so pretty, but if done in a distributed or threaded way > may not be too bad. In the future, I expect more and more tools/frameworks > available to aid in that process. > > Today, this is up to you. > > At Streamy, we solve these problems with layers above HBase. Some of them > keep lots of stuff in-memory and do complex joins in-memory. Others > coordinate the multiple queries to HBase, with or without an OCC-style > transaction. > > My suggestion is to start denormalized. Build naive queries that do lots > of round-trips. See what the performance is like under different conditions > and then go from there. Perhaps Actions are generally immutable, their name > never changes, so you could denormalize that field and cut out half of the > total queries. Have a pool of threads that grab Users so you can do the > join in parallel. Depending on your requirements, this might be sufficient. > Otherwise look at more denormalization, or building a thin layer above. > > JG > > > Mr Hoberto wrote: > >> I can think of two cases that I've been wondering about (I am very new, >> and >> am still reading the docs & archives, so I apologize if this has been >> already covered or if I use the wrong notation...I'm still learning). >> >> First case: >> >> Tracking audits. In the RDMBS world you'd have the following schema: >> >> User (userid, firstname, lastname) >> Actions (actionid, actionname) >> Audit (auditTime, userid, actionid) >> >> I think the answer in the HBase world is to denormalize the data...have a >> structure such as: >> >> audits (auditid, audittime[timestamp], whowhat[family (firstName, >> lastname, >> actionname)]) >> >> The problem happens, as Bharath says, what if a firstName or LastName >> needs >> to be updated? Running a correction on all those denormalized rows is >> going >> to be problematic. >> >> Alternatively, I suppose you could store the User and Actions tables >> separately, and keep the audits structure in HBase storing only IDs , and >> use the website's application layer to "merge" the different data sets >> together for display on a page. The downside there is if you wind up with >> a >> significant amount of users or actions, it'll put a lot of memory pressure >> on the app servers. >> >> Second case: >> >> Doing analysis on two time-series based data structures, such as a "PE >> Ratio" >> >> In the RDBS world you'd have two tables: >> >> Prices (ticker, date, price) >> Earnings (ticker, date, earning) >> >> Again, I think the answer is denormalizing in the HBase world, with a >> structure such as: >> >> PEs (date, timestamp, PERatio[family (ticker, PEvalue)]) >> >> The problem here comes, again, with updates. For instance, what if you >> only >> have available earnings information on an annual basis, and you've come >> across a source that has it quarterly....You'll have to update 3/4 of the >> rows in the denormalized table. >> >> Once again, I apologize for any sort of misunderstanding..I'm still >> learning >> the concepts behind column stores and map/reduce. >> >> -hob >> >> >> On Thu, Jul 16, 2009 at 11:19 AM, Jonathan Gray <[email protected]> >> wrote: >> >> The answer is, it depends. >>> >>> What are the details of what you are trying to join? Is it just a simple >>> 1-to-1 join, or 1-to-many or what? At a minimum, the join would require >>> two >>> round-trips. However, 0.20 can do simple queries in the 1-10ms >>> time-range >>> (closer to 1ms when the blocks are already cached). >>> >>> The comparison to an RDBMS cannot be made directly because a single-node >>> RDBMS with a smallish table will be quite fast at simple index-based >>> joins. >>> I would guess that unloaded, single machine performance of this join >>> operation would be much faster in an RDBMS. >>> >>> But if your table has millions or billions of rows, it's a different >>> situation. HBase performance will stay nearly constant as your table >>> increases, as long as you have the nodes to support your dataset and the >>> load. >>> >>> What are your targets for time (sub 100ms? 10ms?), and what are the >>> details >>> of what you're joining? >>> >>> As far as code is concerned, there is not much to a simple join, so I'm >>> not >>> sure how helpful it would be. If you give some detail perhaps I can >>> provide >>> some pseudo-code for you. >>> >>> JG >>> >>> >>> bharath vissapragada wrote: >>> >>> JG thanks for ur reply, >>>> >>>> Actually iam trying to implement a realtime join of two tables on HBase >>>> . >>>> Actually i tried the idea of denormalizing the tables to avoid the Joins >>>> , >>>> but when we do that Updating the data is really difficult . I >>>> understand >>>> that the features i am trying to implement are that of a RDBMS and HBase >>>> is >>>> used for a different purpose . Even then i want (rather i would like to >>>> try) >>>> to store the data the data in HBase and implement Joins so that i >>>> could >>>> test its performance and if its effective (atleast on large number of >>>> nodes) >>>> , it maybe of somehelp to me . I know some ppl have already tried this . >>>> If >>>> anyone of already tried this can you just tellme how the results are .. >>>> i >>>> mean are they good , when compared to RDBMS join on a single machine ... >>>> >>>> Thanks >>>> >>>> On Wed, Jul 15, 2009 at 8:35 PM, Jonathan Gray <[email protected]> >>>> wrote: >>>> >>>> Bharath, >>>> >>>>> You need to outline what your actual requirements are if you want more >>>>> help. Open-ended questions that just ask for code are usually not >>>>> answered. >>>>> >>>>> What exactly are you trying to join? Does this join need to happen in >>>>> "realtime" or is this part of a batch process? >>>>> >>>>> Could you denormalize your data to prevent needing the join at runtime? >>>>> >>>>> If you provide details about exactly what your data/schema is like (or >>>>> a >>>>> similar example if this is confidential), then many of us are more than >>>>> happy to help you figure out what approach my work best. >>>>> >>>>> When working with HBase, figuring out how you want to pull your data >>>>> out >>>>> is >>>>> key to how you want to put the data in. >>>>> >>>>> JG >>>>> >>>>> >>>>> bharath vissapragada wrote: >>>>> >>>>> Amandeep , can you tell me what kinds of joins u have implemented ? >>>>> and >>>>> >>>>>> which works the best (based on observation ).. Can u show us the >>>>>> source >>>>>> code >>>>>> (if possible) >>>>>> >>>>>> Thanks in advance >>>>>> >>>>>> On Wed, Jul 15, 2009 at 10:46 AM, Amandeep Khurana <[email protected]> >>>>>> wrote: >>>>>> >>>>>> I've been doing joins by writing my own MR jobs. That works best. >>>>>> >>>>>> Not tried cascading yet. >>>>>>> >>>>>>> -ak >>>>>>> >>>>>>> On 7/14/09, bharath vissapragada <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>> Thats fine .. I know that hbase has completely different usage >>>>>>> >>>>>>>> compared >>>>>>>> >>>>>>>> to >>>>>>>> >>>>>>> SQL .. But for my application there is some kind of dependency >>>>>>> >>>>>>>> involved >>>>>>>> among the tables . So i need to implement a Join . I wanted to know >>>>>>>> >>>>>>>> whether >>>>>>>> >>>>>>> there is some kind of implementation already >>>>>>> >>>>>>>> .. >>>>>>>> >>>>>>>> Thanks >>>>>>>> On Wed, Jul 15, 2009 at 10:30 AM, Ryan Rawson <[email protected]> >>>>>>>> >>>>>>>> wrote: >>>>>>>> >>>>>>> HBase != SQL. >>>>>>> >>>>>>>> You might want map reduce or cascading. >>>>>>>>> >>>>>>>>> On Tue, Jul 14, 2009 at 9:56 PM, bharath >>>>>>>>> vissapragada<[email protected]> wrote: >>>>>>>>> >>>>>>>>> Hi all , >>>>>>>>> >>>>>>>>>> I want to join(similar to relational databases join) two tables in >>>>>>>>>> >>>>>>>>>> HBase >>>>>>>>>> >>>>>>>>> . >>>>>>>> >>>>>>>> Can anyone tell me whether it is already implemented in the source >>>>>>>>> ! >>>>>>>>> >>>>>>>>>> Thanks in Advance >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> >>>>>>>>> >>>>>>> Amandeep Khurana >>>>>>> Computer Science Graduate Student >>>>>>> University of California, Santa Cruz >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>
