Ryan, The real question is how you want to query them.
Do you want to look at them in chronological order? Do you want to be able to efficiently look at all requests for a particular user? All or a particular time period? Efficiently access a known request's (user + timestamp) serialized object? Or you just want to see all of it all the time, like your MR jobs will scan across everything? 1000s of columns should be no problem, I have hundreds of thousands in production in a single row-family. There may be issues with millions, and you'll need to take into account the potential size of your objects. A row can only grow to the size of a region (which is 256M default but configurable). Your suggested design is best suited for looking at all requests for a user, less-so if you're interested in looking at things with respect to time. Though if you are only concerned with MR jobs, you typically have the entire table as input so this design can be okay for looking only at certain time ranges. Another possibility might be to have row keys that are timestamp+user/ip. Your table would be ordered by time so it would be easier to use scanners to efficiently seek to a stamp and look forward. I've not actually attempted to do an MR job with a startRow, not sure if it's easy to do or not. But in the case that you end up with years worth of data (thousands of regions in a table) and you want to process 1 day, it could end up being much more efficient not having to scan everything (thousands of unnecessary map tasks). I'm thinking out loud a bit, hopefully others chime in :) JG On Sat, December 20, 2008 3:34 pm, Ryan LeCompte wrote: > Hello all, > > > I'd like a little advice on the best way to design a table in HBase. > Basically, I want to store apache access log requests in HBase so that > I can query them efficiently. The problem is that each request may > have 100's of parameters and also many requests can come in for the same > user/ip address. > > So, I was thinking of the following: > > > 1 table called "requests" and a single column family called "request" > > > Each row would have a key representing the user's ip address/unique > identifier, and the columns would be a timestamp of when the request > occurred, and the cell value would be a serializable Java object > representing all the url parameters of the apache web server log request > at that specific time. > > Possible problems: > > > 1) There may be thousands of requests that belong to a single unique > identifier (so there would be 1000s of columns) > > Any suggestions on how to represent this best? Is anyone doing > anything similar? > > FYI: I'm using Hadoop 0.19 and HBase-TRUNK. > > > Thanks, > Ryan > > >
