tim robertson wrote:
Presumably a filter in a scanner runs as a filter in the first Map()
job or is there something else going on?

When you setup an MR job, you can pass a filter (if using TableInputFormat -- see setTableFilter)). Filter will run in all maps, not just the first.

Unfortunately, its not possible to specify start/stop row running a MR at the moment, not unless you do your own splitter. This is being looked into (HBASE-1075).

St.Ack

Thanks

Tim

On Sun, Dec 21, 2008 at 3:16 AM, stack <[email protected]> wrote:
Ryan LeCompte wrote:
JG,

Thanks for the tips!

Question: If I decide to use a combined key of timestamp + ID, do you
know if the query API has a way to do a partial search of the row key?

There is a filter mechanism in hbase.  They run server-side.  They filter on
row and/or column content.

Scanners can be passed a start and end row.

One approach would be to start a scanner between the times you are
interested in and pass in a filter that only returns the client rows for a
particular ID (or that match a particular regex), for example.

St.Ack

Or would I have to write a M/R job that does a quick parse of the key
and not process any row key that doesn't fit within my time range?

Thanks,
Ryan


On Sat, Dec 20, 2008 at 8:02 PM,  <[email protected]> wrote:

Ryan,

The real question is how you want to query them.

Do you want to look at them in chronological order?  Do you want to be
able to efficiently look at all requests for a particular user?  All or a
particular time period?  Efficiently access a known request's (user +
timestamp) serialized object?  Or you just want to see all of it all the
time, like your MR jobs will scan across everything?

1000s of columns should be no problem, I have hundreds of thousands in
production in a single row-family.  There may be issues with millions,
and
you'll need to take into account the potential size of your objects.  A
row can only grow to the size of a region (which is 256M default but
configurable).

Your suggested design is best suited for looking at all requests for a
user, less-so if you're interested in looking at things with respect to
time.  Though if you are only concerned with MR jobs, you typically have
the entire table as input so this design can be okay for looking only at
certain time ranges.

Another possibility might be to have row keys that are timestamp+user/ip.
Your table would be ordered by time so it would be easier to use scanners
to efficiently seek to a stamp and look forward.  I've not actually
attempted to do an MR job with a startRow, not sure if it's easy to do or
not.  But in the case that you end up with years worth of data (thousands
of regions in a table) and you want to process 1 day, it could end up
being much more efficient not having to scan everything (thousands of
unnecessary map tasks).

I'm thinking out loud a bit, hopefully others chime in :)

JG

On Sat, December 20, 2008 3:34 pm, Ryan LeCompte wrote:

Hello all,


I'd like a little advice on the best way to design a table in HBase.
Basically, I want to store apache access log requests in HBase so that
I can query them efficiently. The problem is that each request may
have 100's of parameters and also many requests can come in for the same
user/ip address.

So, I was thinking of the following:


1 table called "requests" and a single column family called "request"


Each row would have a key representing the user's ip address/unique
identifier, and the columns would be a timestamp of when the request
occurred, and the cell value would be a serializable Java object
representing all the url parameters of the apache web server log request
at that specific time.

Possible problems:


1) There may be thousands of requests that belong to a single unique
identifier (so there would be 1000s of columns)

Any suggestions on how to represent this best? Is anyone doing
anything similar?

FYI: I'm using Hadoop 0.19 and HBase-TRUNK.


Thanks,
Ryan





Reply via email to