> From: jg...@facebook.com
> To: hbase-user@hadoop.apache.org
> Date: Thu, 1 Apr 2010 09:34:17 -0700
> Subject: RE: How to do "Group By" in HBase
> 
> For 1/2, it seems that your row key design is ideal for those queries.  You 
> say it's inefficient because you need to scan the "whole session of data" 
> containing hammer... but wouldn't you always have to do that unless you were 
> doing some kind of summary/rollups?  Even in a relational database you scan 
> lots of stuff!
> 
> I can't say I fully understand query #3.  You select a single product and 
> single store, but then group by product and store?  Is that group by required 
> if you just want a single SUM() result?  Sorry my SQL is rusty :)
> 
> If you're really just after counts of stuff, you could potentially keep 
> counters rather than scanning all the original source data.  Check out 
> HTable.incrementColumnValue().  You could have a table with similar rows but 
> that were rolled up on different dimensions.
> 
> For example, if you wanted to roll-up counts for each product by each day, 
> you could have row=product:day and then could have a single column or column 
> for each store.  In that column you would use incrementColumnValue to 
> increment for each num().  Then you could do a single row query to get all of 
> the counts for that product for that day, by store.  If you wanted for a 
> range of days, you could scan the days you need, but this would be more 
> efficient than scanning the actual orders since each cell is just a counter.

Thank you, Jonathan. I see your suggestion here, which basically have 
row=product:day and column=store. That's a great schema answering the above 
question. However, I have a question following that (I did not describe the 
whole picture my problem): 
Stores belong to different zip codes; zip codes belong to different counties; 
counties belong to different states.      -- I want to be able to query each 
zip code's sales in a range of dates, and I want to query each state's sales in 
a range of dates. How can I get different level of aggregation efficiently? 
(This seems to fall into OLAP cube problem -- moreover, my original problem 
only holds 3-dimension in its key, but what if my problem holds N-dimension 
where N>=4?)
Thanks,Sean


> Make sense?
> 
> JG
> 
> > -----Original Message-----
> > From: Sean [mailto:seanatpur...@hotmail.com]
> > Sent: Thursday, April 01, 2010 1:46 AM
> > To: hbase-user@hadoop.apache.org
> > Subject: How to do "Group By" in HBase
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > I have the follow kind of data (a typical store sell record):
> > {product, date, store_name} --> number
> > 
> > I understand that if I choose the following row key design, I will be
> > able to quickly GROUP BY store_name.
> > 
> > row key -- product:date:store_name
> > column name -- number
> > 
> > In other words, I can efficiently achieve the following logic (just a
> > HBase scan) -- adjacent scan.
> > 1) SELECT SUM(num) FROM sale_history_table where product="hammer" GROUP
> > BY product
> > 2) SELECT SUM(num) FROM sale_history_table where product="hammer",
> > date="12/04/2009" GROUP BY product date
> > 
> > However, it's very inefficient to do the following thing because to
> > achieve this, I basically need to scan the whole session of data that
> > containing "hammer"
> > 
> > 3) SELECT SUM(num) FROM sale_history_table where product="hammer",
> > store_name="SFO_AIRPORT" GROUP BY product store_name
> > 
> > Can someone give me an advice on what I should design my HBase schema
> > if I choose to use native Hbase (I am thinking a second table may help
> > case 3, but have not come up with an idea)?
> > 
> > 
> > 
> > ( I understand Zohmg is good at these kind of problem, but I'd rather
> > choose it as the last resort)
> > 
> > Thanks,
> > Sean
> > 
> > The New Busy is not the old busy. Search, chat and e-mail from your
> > inbox. Get started.
> > _________________________________________________________________
> > Hotmail is redefining busy with tools for the New Busy. Get more from
> > your inbox.
> > http://www.windowslive.com/campaign/thenewbusy?ocid=PID27925::T:WLMTAGL
> > :ON:WL:en-US:WM_HMP:032010_2
                                          
_________________________________________________________________
The New Busy think 9 to 5 is a cute idea. Combine multiple calendars with 
Hotmail. 
http://www.windowslive.com/campaign/thenewbusy?tile=multicalendar&ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_5

Reply via email to