Re: Advice on table design

Ryan LeCompte Sat, 20 Dec 2008 18:09:25 -0800

JG,

Thanks for the tips!


Question: If I decide to use a combined key of timestamp + ID, do you
know if the query API has a way to do a partial search of the row key?
Or would I have to write a M/R job that does a quick parse of the key
and not process any row key that doesn't fit within my time range?

Thanks,
Ryan


On Sat, Dec 20, 2008 at 8:02 PM,  <[email protected]> wrote:
> Ryan,
>
> The real question is how you want to query them.
>
> Do you want to look at them in chronological order?  Do you want to be
> able to efficiently look at all requests for a particular user?  All or a
> particular time period?  Efficiently access a known request's (user +
> timestamp) serialized object?  Or you just want to see all of it all the
> time, like your MR jobs will scan across everything?
>
> 1000s of columns should be no problem, I have hundreds of thousands in
> production in a single row-family.  There may be issues with millions, and
> you'll need to take into account the potential size of your objects.  A
> row can only grow to the size of a region (which is 256M default but
> configurable).
>
> Your suggested design is best suited for looking at all requests for a
> user, less-so if you're interested in looking at things with respect to
> time.  Though if you are only concerned with MR jobs, you typically have
> the entire table as input so this design can be okay for looking only at
> certain time ranges.
>
> Another possibility might be to have row keys that are timestamp+user/ip.
> Your table would be ordered by time so it would be easier to use scanners
> to efficiently seek to a stamp and look forward.  I've not actually
> attempted to do an MR job with a startRow, not sure if it's easy to do or
> not.  But in the case that you end up with years worth of data (thousands
> of regions in a table) and you want to process 1 day, it could end up
> being much more efficient not having to scan everything (thousands of
> unnecessary map tasks).
>
> I'm thinking out loud a bit, hopefully others chime in :)
>
> JG
>
> On Sat, December 20, 2008 3:34 pm, Ryan LeCompte wrote:
>> Hello all,
>>
>>
>> I'd like a little advice on the best way to design a table in HBase.
>> Basically, I want to store apache access log requests in HBase so that
>> I can query them efficiently. The problem is that each request may
>> have 100's of parameters and also many requests can come in for the same
>> user/ip address.
>>
>> So, I was thinking of the following:
>>
>>
>> 1 table called "requests" and a single column family called "request"
>>
>>
>> Each row would have a key representing the user's ip address/unique
>> identifier, and the columns would be a timestamp of when the request
>> occurred, and the cell value would be a serializable Java object
>> representing all the url parameters of the apache web server log request
>> at that specific time.
>>
>> Possible problems:
>>
>>
>> 1) There may be thousands of requests that belong to a single unique
>> identifier (so there would be 1000s of columns)
>>
>> Any suggestions on how to represent this best? Is anyone doing
>> anything similar?
>>
>> FYI: I'm using Hadoop 0.19 and HBase-TRUNK.
>>
>>
>> Thanks,
>> Ryan
>>
>>
>>
>
>

Re: Advice on table design

Reply via email to