JG, Thanks for the tips!
Question: If I decide to use a combined key of timestamp + ID, do you know if the query API has a way to do a partial search of the row key? Or would I have to write a M/R job that does a quick parse of the key and not process any row key that doesn't fit within my time range? Thanks, Ryan On Sat, Dec 20, 2008 at 8:02 PM, <[email protected]> wrote: > Ryan, > > The real question is how you want to query them. > > Do you want to look at them in chronological order? Do you want to be > able to efficiently look at all requests for a particular user? All or a > particular time period? Efficiently access a known request's (user + > timestamp) serialized object? Or you just want to see all of it all the > time, like your MR jobs will scan across everything? > > 1000s of columns should be no problem, I have hundreds of thousands in > production in a single row-family. There may be issues with millions, and > you'll need to take into account the potential size of your objects. A > row can only grow to the size of a region (which is 256M default but > configurable). > > Your suggested design is best suited for looking at all requests for a > user, less-so if you're interested in looking at things with respect to > time. Though if you are only concerned with MR jobs, you typically have > the entire table as input so this design can be okay for looking only at > certain time ranges. > > Another possibility might be to have row keys that are timestamp+user/ip. > Your table would be ordered by time so it would be easier to use scanners > to efficiently seek to a stamp and look forward. I've not actually > attempted to do an MR job with a startRow, not sure if it's easy to do or > not. But in the case that you end up with years worth of data (thousands > of regions in a table) and you want to process 1 day, it could end up > being much more efficient not having to scan everything (thousands of > unnecessary map tasks). > > I'm thinking out loud a bit, hopefully others chime in :) > > JG > > On Sat, December 20, 2008 3:34 pm, Ryan LeCompte wrote: >> Hello all, >> >> >> I'd like a little advice on the best way to design a table in HBase. >> Basically, I want to store apache access log requests in HBase so that >> I can query them efficiently. The problem is that each request may >> have 100's of parameters and also many requests can come in for the same >> user/ip address. >> >> So, I was thinking of the following: >> >> >> 1 table called "requests" and a single column family called "request" >> >> >> Each row would have a key representing the user's ip address/unique >> identifier, and the columns would be a timestamp of when the request >> occurred, and the cell value would be a serializable Java object >> representing all the url parameters of the apache web server log request >> at that specific time. >> >> Possible problems: >> >> >> 1) There may be thousands of requests that belong to a single unique >> identifier (so there would be 1000s of columns) >> >> Any suggestions on how to represent this best? Is anyone doing >> anything similar? >> >> FYI: I'm using Hadoop 0.19 and HBase-TRUNK. >> >> >> Thanks, >> Ryan >> >> >> > >
