I have a suggestion for basic functionality. Let me know if this should instead be posted somewhere (Jira? Wiki?).
I am new to map-reduce and hadoop, so I apologize if this is too newbie / And/or I've overlooked existing functionality (i.e. if there's already a way to do this). In implementing Naïve Bayes (etc.) with Hadoop, I thought it would be nice to basically use the map-reduce to do a "hash join" of the documents to be classified with the trained model (which can become quite large). You can do this kind of thing in Pig by grouping two sets / streams / ids on a join element, but I think it makes sense to move this kind of functionality back into the map-reduce basic mechanism. In other words, map-reduce is just a way to bucket processing, or join (in this case different) "database tables" or data types into one bucket on some join field (or computed value). The reduce then has the problem / responsibility of determining what the appropriate join processing consists of. So in our example, one set of data is (feature | classification, corresponding-weight-increment), and the other set of data is (feature | document-id). The reduce (I'm fuzzy here) emits things like (document-id, classification | weight-increment ), which then gets reduced to (document-id, classification | weight) and/or (document-id | classification, weight) --> (document-id | best-classification) One thing one needs to support this kind of thing is the ability to process different data sets into the same map-reduce stream, but with different map interpretations. I.e. there may not only be one set of files, but multiple sets of files, each with their own map "interpretation". In a perfect language (or maybe even Java) one would define the output class as a union class, with some distinguishing feature / method so the reduce task would know which was which. You might also want to provide ways to guarantee that some things appear to the reduce task before others, like that the model parameters show up before they're used (or vice versa). I think the line between what's a map and a reduce task might be fuzzy as well, in that a reduce task might naturally emit something else as if it's a map task. Is that currently possible, or does one have to define a kind of null-op map task in-between? I guess basically I'm advocating watching what comes out of the Pig incubator, and think how that functionality can be brought closer to the metal. -- Steve