I haven't done MR with HBase, but when I do will have a something close to what he wants.

I want to do MR where the input is in range between a known startKey and endKey. (His example is an extreme where his range is very small)

Is there an easy way to give a key range to the MR job, so that it doesn't have to walk through keys I know I don't want?


On 8/12/09 6:01 AM, Alex Spodinets wrote:
Ryan, Tim,

thanks for your response. It is fairly obvious  that scanning through entire
table is an option. But why would you scan if you know what you're looking
for. My research brought me to the Split algorithm to be changed for
TableInputBase (or it's ancestor) and produce only one split for the actual
location of the row. Do you think this will work ?

My intention is to explore the other usage of Map\Reduce - not as a
batch\parallel mass processing system but as a way to run single task and
track it utilizing it's ability to run code where data is.
Any thoughts will be highly appreciated.

On Wed, Aug 12, 2009 at 5:40 AM, Ryan Rawson<[email protected]>  wrote:

You can write map that processes the entire table discarding
uninteresting rows, and the scheduler will make a best-effort attempt
to scheduling locality. You will want to set up rack awareness to
ensure this is as effective as possible.

But how big are these rows? Rows that are bigger than the Xmx of a VM
don't really work right now (see: 0.21 roadmap). And for isolated
queries, locality really doesnt buy you as much as you think it might.
Save maybe 0.1ms (ping time on a modern LAN) or less.

-ryan

On Tue, Aug 11, 2009 at 9:07 AM, Alex Spodinets<[email protected]>
wrote:
I do know the row. I want MR job to be run on the closest server to where
data is. So this MR job will process only data for this one row.

Thanks,
Alex.

On Tue, Aug 11, 2009 at 6:50 PM, stack<[email protected]>  wrote:

On Tue, Aug 11, 2009 at 7:35 AM, Alex Spodinets<[email protected]>
wrote:

Hello,

Is it possible to run a Map\Reduce job for only one row in table? Thus
skipping the unnecessary cycling through other rows by ignoring them
manually or via "skip mode".

The idea behind it is to use Map\Reduce more like an application server
with
data location awareness vs batch\parallel processing system.

Please add more description.  I'm having trouble understanding what you
are
asking.

+ If you know the row you want, just ask hbase -- you don't have to go
via
MR.
+ MR is usually offline/batch operations but when you say things like
'application server' I get the sense you are talking about real-time
lookups?

Thanks,
St.Ack


Reply via email to