Ryan,

The real question is how you want to query them.

Do you want to look at them in chronological order?  Do you want to be
able to efficiently look at all requests for a particular user?  All or a
particular time period?  Efficiently access a known request's (user +
timestamp) serialized object?  Or you just want to see all of it all the
time, like your MR jobs will scan across everything?

1000s of columns should be no problem, I have hundreds of thousands in
production in a single row-family.  There may be issues with millions, and
you'll need to take into account the potential size of your objects.  A
row can only grow to the size of a region (which is 256M default but
configurable).

Your suggested design is best suited for looking at all requests for a
user, less-so if you're interested in looking at things with respect to
time.  Though if you are only concerned with MR jobs, you typically have
the entire table as input so this design can be okay for looking only at
certain time ranges.

Another possibility might be to have row keys that are timestamp+user/ip. 
Your table would be ordered by time so it would be easier to use scanners
to efficiently seek to a stamp and look forward.  I've not actually
attempted to do an MR job with a startRow, not sure if it's easy to do or
not.  But in the case that you end up with years worth of data (thousands
of regions in a table) and you want to process 1 day, it could end up
being much more efficient not having to scan everything (thousands of
unnecessary map tasks).

I'm thinking out loud a bit, hopefully others chime in :)

JG

On Sat, December 20, 2008 3:34 pm, Ryan LeCompte wrote:
> Hello all,
>
>
> I'd like a little advice on the best way to design a table in HBase.
> Basically, I want to store apache access log requests in HBase so that
> I can query them efficiently. The problem is that each request may
> have 100's of parameters and also many requests can come in for the same
> user/ip address.
>
> So, I was thinking of the following:
>
>
> 1 table called "requests" and a single column family called "request"
>
>
> Each row would have a key representing the user's ip address/unique
> identifier, and the columns would be a timestamp of when the request
> occurred, and the cell value would be a serializable Java object
> representing all the url parameters of the apache web server log request
> at that specific time.
>
> Possible problems:
>
>
> 1) There may be thousands of requests that belong to a single unique
> identifier (so there would be 1000s of columns)
>
> Any suggestions on how to represent this best? Is anyone doing
> anything similar?
>
> FYI: I'm using Hadoop 0.19 and HBase-TRUNK.
>
>
> Thanks,
> Ryan
>
>
>

Reply via email to