Basic functionality suggestion

Handerson, Steven K. Tue, 22 Jul 2008 12:58:06 -0700

I have a suggestion for basic functionality.
Let me know if this should instead be posted somewhere (Jira?  Wiki?).


I am new to map-reduce and hadoop, so I apologize if this is too newbie / 
And/or I've overlooked existing functionality (i.e. if there's already a way to 
do this).

In implementing Naïve Bayes (etc.) with Hadoop, I thought it would be nice to 
basically use the map-reduce to do a "hash join" of the documents to be 
classified with the trained model (which can become quite large).  You can do 
this kind of thing in Pig by grouping two sets / streams / ids on a join 
element, but I think it makes sense to move this kind of functionality back 
into the map-reduce basic mechanism.

In other words, map-reduce is just a way to bucket processing, or join (in this 
case different) "database tables" or data types into one bucket on some join 
field (or computed value).  The reduce then has the problem / responsibility of 
determining what the appropriate join processing consists of.

So in our example, one set of data is (feature | classification, 
corresponding-weight-increment), and the other set of data is (feature | 
document-id).  The reduce (I'm fuzzy here) emits things like (document-id, 
classification | weight-increment ), which then gets reduced to (document-id, 
classification | weight) and/or (document-id | classification, weight) --> 
(document-id | best-classification)

One thing one needs to support this kind of thing is the ability to process 
different data sets into the same map-reduce stream, but with different map 
interpretations.  I.e. there may not only be one set of files, but multiple 
sets of files, each with their own map "interpretation".  In a perfect language 
(or maybe even Java) one would define the output class as a union class, with 
some distinguishing feature / method so the reduce task would know which was 
which.  

You might also want to provide ways to guarantee that some things appear to the 
reduce task before others, like that the model parameters show up before 
they're used (or vice versa).

I think the line between what's a map and a reduce task might be fuzzy as well, 
in that a reduce task might naturally emit something else as if it's a map task.
Is that currently possible, or does one have to define a kind of null-op map 
task in-between?

I guess basically I'm advocating watching what comes out of the Pig incubator, 
and think how that functionality can be brought closer to the metal.

-- Steve

Basic functionality suggestion

Reply via email to