Presumably a filter in a scanner runs as a filter in the first Map() job or is there something else going on?
Thanks Tim On Sun, Dec 21, 2008 at 3:16 AM, stack <[email protected]> wrote: > Ryan LeCompte wrote: >> >> JG, >> >> Thanks for the tips! >> >> Question: If I decide to use a combined key of timestamp + ID, do you >> know if the query API has a way to do a partial search of the row key? >> > > There is a filter mechanism in hbase. They run server-side. They filter on > row and/or column content. > > Scanners can be passed a start and end row. > > One approach would be to start a scanner between the times you are > interested in and pass in a filter that only returns the client rows for a > particular ID (or that match a particular regex), for example. > > St.Ack > >> Or would I have to write a M/R job that does a quick parse of the key >> and not process any row key that doesn't fit within my time range? >> >> Thanks, >> Ryan >> >> >> On Sat, Dec 20, 2008 at 8:02 PM, <[email protected]> wrote: >> >>> >>> Ryan, >>> >>> The real question is how you want to query them. >>> >>> Do you want to look at them in chronological order? Do you want to be >>> able to efficiently look at all requests for a particular user? All or a >>> particular time period? Efficiently access a known request's (user + >>> timestamp) serialized object? Or you just want to see all of it all the >>> time, like your MR jobs will scan across everything? >>> >>> 1000s of columns should be no problem, I have hundreds of thousands in >>> production in a single row-family. There may be issues with millions, >>> and >>> you'll need to take into account the potential size of your objects. A >>> row can only grow to the size of a region (which is 256M default but >>> configurable). >>> >>> Your suggested design is best suited for looking at all requests for a >>> user, less-so if you're interested in looking at things with respect to >>> time. Though if you are only concerned with MR jobs, you typically have >>> the entire table as input so this design can be okay for looking only at >>> certain time ranges. >>> >>> Another possibility might be to have row keys that are timestamp+user/ip. >>> Your table would be ordered by time so it would be easier to use scanners >>> to efficiently seek to a stamp and look forward. I've not actually >>> attempted to do an MR job with a startRow, not sure if it's easy to do or >>> not. But in the case that you end up with years worth of data (thousands >>> of regions in a table) and you want to process 1 day, it could end up >>> being much more efficient not having to scan everything (thousands of >>> unnecessary map tasks). >>> >>> I'm thinking out loud a bit, hopefully others chime in :) >>> >>> JG >>> >>> On Sat, December 20, 2008 3:34 pm, Ryan LeCompte wrote: >>> >>>> >>>> Hello all, >>>> >>>> >>>> I'd like a little advice on the best way to design a table in HBase. >>>> Basically, I want to store apache access log requests in HBase so that >>>> I can query them efficiently. The problem is that each request may >>>> have 100's of parameters and also many requests can come in for the same >>>> user/ip address. >>>> >>>> So, I was thinking of the following: >>>> >>>> >>>> 1 table called "requests" and a single column family called "request" >>>> >>>> >>>> Each row would have a key representing the user's ip address/unique >>>> identifier, and the columns would be a timestamp of when the request >>>> occurred, and the cell value would be a serializable Java object >>>> representing all the url parameters of the apache web server log request >>>> at that specific time. >>>> >>>> Possible problems: >>>> >>>> >>>> 1) There may be thousands of requests that belong to a single unique >>>> identifier (so there would be 1000s of columns) >>>> >>>> Any suggestions on how to represent this best? Is anyone doing >>>> anything similar? >>>> >>>> FYI: I'm using Hadoop 0.19 and HBase-TRUNK. >>>> >>>> >>>> Thanks, >>>> Ryan >>>> >>>> >>>> >>>> >>> >>> > >
