Presumably a filter in a scanner runs as a filter in the first Map()
job or is there something else going on?

Thanks

Tim

On Sun, Dec 21, 2008 at 3:16 AM, stack <[email protected]> wrote:
> Ryan LeCompte wrote:
>>
>> JG,
>>
>> Thanks for the tips!
>>
>> Question: If I decide to use a combined key of timestamp + ID, do you
>> know if the query API has a way to do a partial search of the row key?
>>
>
> There is a filter mechanism in hbase.  They run server-side.  They filter on
> row and/or column content.
>
> Scanners can be passed a start and end row.
>
> One approach would be to start a scanner between the times you are
> interested in and pass in a filter that only returns the client rows for a
> particular ID (or that match a particular regex), for example.
>
> St.Ack
>
>> Or would I have to write a M/R job that does a quick parse of the key
>> and not process any row key that doesn't fit within my time range?
>>
>> Thanks,
>> Ryan
>>
>>
>> On Sat, Dec 20, 2008 at 8:02 PM,  <[email protected]> wrote:
>>
>>>
>>> Ryan,
>>>
>>> The real question is how you want to query them.
>>>
>>> Do you want to look at them in chronological order?  Do you want to be
>>> able to efficiently look at all requests for a particular user?  All or a
>>> particular time period?  Efficiently access a known request's (user +
>>> timestamp) serialized object?  Or you just want to see all of it all the
>>> time, like your MR jobs will scan across everything?
>>>
>>> 1000s of columns should be no problem, I have hundreds of thousands in
>>> production in a single row-family.  There may be issues with millions,
>>> and
>>> you'll need to take into account the potential size of your objects.  A
>>> row can only grow to the size of a region (which is 256M default but
>>> configurable).
>>>
>>> Your suggested design is best suited for looking at all requests for a
>>> user, less-so if you're interested in looking at things with respect to
>>> time.  Though if you are only concerned with MR jobs, you typically have
>>> the entire table as input so this design can be okay for looking only at
>>> certain time ranges.
>>>
>>> Another possibility might be to have row keys that are timestamp+user/ip.
>>> Your table would be ordered by time so it would be easier to use scanners
>>> to efficiently seek to a stamp and look forward.  I've not actually
>>> attempted to do an MR job with a startRow, not sure if it's easy to do or
>>> not.  But in the case that you end up with years worth of data (thousands
>>> of regions in a table) and you want to process 1 day, it could end up
>>> being much more efficient not having to scan everything (thousands of
>>> unnecessary map tasks).
>>>
>>> I'm thinking out loud a bit, hopefully others chime in :)
>>>
>>> JG
>>>
>>> On Sat, December 20, 2008 3:34 pm, Ryan LeCompte wrote:
>>>
>>>>
>>>> Hello all,
>>>>
>>>>
>>>> I'd like a little advice on the best way to design a table in HBase.
>>>> Basically, I want to store apache access log requests in HBase so that
>>>> I can query them efficiently. The problem is that each request may
>>>> have 100's of parameters and also many requests can come in for the same
>>>> user/ip address.
>>>>
>>>> So, I was thinking of the following:
>>>>
>>>>
>>>> 1 table called "requests" and a single column family called "request"
>>>>
>>>>
>>>> Each row would have a key representing the user's ip address/unique
>>>> identifier, and the columns would be a timestamp of when the request
>>>> occurred, and the cell value would be a serializable Java object
>>>> representing all the url parameters of the apache web server log request
>>>> at that specific time.
>>>>
>>>> Possible problems:
>>>>
>>>>
>>>> 1) There may be thousands of requests that belong to a single unique
>>>> identifier (so there would be 1000s of columns)
>>>>
>>>> Any suggestions on how to represent this best? Is anyone doing
>>>> anything similar?
>>>>
>>>> FYI: I'm using Hadoop 0.19 and HBase-TRUNK.
>>>>
>>>>
>>>> Thanks,
>>>> Ryan
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>
>

Reply via email to