> From: jg...@facebook.com
> To: hbase-user@hadoop.apache.org
> Date: Thu, 1 Apr 2010 09:34:17 -0700
> Subject: RE: How to do "Group By" in HBase
>
> For 1/2, it seems that your row key design is ideal for those queries. You
> say it's inefficient because you need to scan the "whole session of data"
> containing hammer... but wouldn't you always have to do that unless you were
> doing some kind of summary/rollups? Even in a relational database you scan
> lots of stuff!
>
> I can't say I fully understand query #3. You select a single product and
> single store, but then group by product and store? Is that group by required
> if you just want a single SUM() result? Sorry my SQL is rusty :)
>
> If you're really just after counts of stuff, you could potentially keep
> counters rather than scanning all the original source data. Check out
> HTable.incrementColumnValue(). You could have a table with similar rows but
> that were rolled up on different dimensions.
>
> For example, if you wanted to roll-up counts for each product by each day,
> you could have row=product:day and then could have a single column or column
> for each store. In that column you would use incrementColumnValue to
> increment for each num(). Then you could do a single row query to get all of
> the counts for that product for that day, by store. If you wanted for a
> range of days, you could scan the days you need, but this would be more
> efficient than scanning the actual orders since each cell is just a counter.
Thank you, Jonathan. I see your suggestion here, which basically have
row=product:day and column=store. That's a great schema answering the above
question. However, I have a question following that (I did not describe the
whole picture my problem):
Stores belong to different zip codes; zip codes belong to different counties;
counties belong to different states. -- I want to be able to query each
zip code's sales in a range of dates, and I want to query each state's sales in
a range of dates. How can I get different level of aggregation efficiently?
(This seems to fall into OLAP cube problem -- moreover, my original problem
only holds 3-dimension in its key, but what if my problem holds N-dimension
where N>=4?)
Thanks,Sean
> Make sense?
>
> JG
>
> > -----Original Message-----
> > From: Sean [mailto:seanatpur...@hotmail.com]
> > Sent: Thursday, April 01, 2010 1:46 AM
> > To: hbase-user@hadoop.apache.org
> > Subject: How to do "Group By" in HBase
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > I have the follow kind of data (a typical store sell record):
> > {product, date, store_name} --> number
> >
> > I understand that if I choose the following row key design, I will be
> > able to quickly GROUP BY store_name.
> >
> > row key -- product:date:store_name
> > column name -- number
> >
> > In other words, I can efficiently achieve the following logic (just a
> > HBase scan) -- adjacent scan.
> > 1) SELECT SUM(num) FROM sale_history_table where product="hammer" GROUP
> > BY product
> > 2) SELECT SUM(num) FROM sale_history_table where product="hammer",
> > date="12/04/2009" GROUP BY product date
> >
> > However, it's very inefficient to do the following thing because to
> > achieve this, I basically need to scan the whole session of data that
> > containing "hammer"
> >
> > 3) SELECT SUM(num) FROM sale_history_table where product="hammer",
> > store_name="SFO_AIRPORT" GROUP BY product store_name
> >
> > Can someone give me an advice on what I should design my HBase schema
> > if I choose to use native Hbase (I am thinking a second table may help
> > case 3, but have not come up with an idea)?
> >
> >
> >
> > ( I understand Zohmg is good at these kind of problem, but I'd rather
> > choose it as the last resort)
> >
> > Thanks,
> > Sean
> >
> > The New Busy is not the old busy. Search, chat and e-mail from your
> > inbox. Get started.
> > _________________________________________________________________
> > Hotmail is redefining busy with tools for the New Busy. Get more from
> > your inbox.
> > http://www.windowslive.com/campaign/thenewbusy?ocid=PID27925::T:WLMTAGL
> > :ON:WL:en-US:WM_HMP:032010_2
_________________________________________________________________
The New Busy think 9 to 5 is a cute idea. Combine multiple calendars with
Hotmail.
http://www.windowslive.com/campaign/thenewbusy?tile=multicalendar&ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_5