Are you currently being limited by network throughput? I wouldn't become obsessed with data locality until it becomes the bottleneck.

Even the naive implementation of this would not be entirely simple... but then what do you do if the regions on that node changed during the course of the map (splits, reassigns, etc)?

I would imagine you'll have other things to optimize well before network throughput becomes an issue. And if you do go down the route of this kind of (potential) hyper-optimization, you'll need to be aware of the hardware you're using and the performance impact of different approaches. If you only have a single disk, then concurrent scans of two different tables can cause disk contention, etc...

Are you joining 2 tables by matching row key to row key? If so, then this sounds like 2 tables that should be 1 table with multiple families (that's really the value in multiple families... each family is really like a separate table, but they are easily joined together by row key).

JG

bharath v wrote:
Kevin : What if i want to implement a Join of 2 tables . Is there an
alternative to TableInputFormat (TIF) because it reads a single table at a
time . I thought of a solution ,but Iam not sure whether it works fine .

Suppose we want to join table1 and table2 and we use TIF on table1 and the
Map phase is as follows .

Map :

Suppose the TIF is reading the region1 of table1. Then we can IN SOME WAY
get the regions start and end keys corresponding to the table2 on that
system (if any) where map is being executed
and read the table2 contents in the Map . This is in some way preserving
DATA LOCALITY..

Is this feasible ? Any comments ?



On Fri, Oct 16, 2009 at 12:09 AM, Kevin Peterson <kpeter...@biz360.com>wrote:

On Thu, Oct 15, 2009 at 11:30 AM, Something Something <
luckyguy2...@yahoo.com> wrote:

1) I don't think TableInputFormat is useful in this case.  Looks like
it's
used for scanning columns from a single HTable.
2) TableMapReduceUtil - same problem.  Seems like this works with just
one
table.
3) JV recommended NLineInputFormat, but my parameters are not in a file.
 They come from multiple files and are in memory.

I guess what I am looking for is something like... InMemoryInputFormat...
similar to FileInputFormat & DbInputFormat.  There's no such class right
now.

Worse comes to worst, I can write the parameters into a flat file, and
use
FileInputFormat - but that will slow down this process considerably.  Is
there no other way?

So you need to pull input from multiple tables at once? Are you expecting
to do a join on these tables? If you explain what the data looks like, we'd
understand better. What are your tables, and what would you like to treat
as
a single input record?


Reply via email to