Ted Dunning wrote:
The easiest solution is to not worry too much about running an extra MR
step.
So,
- run a first pass to get the counts. Use word count as the pattern. Store
the results in a file.
- run the second pass. You can now read the hash-table from the file you
stored in pass 1.
Another approach is to do the counting in your maps as specified and then
before exiting, you can emit special records for each key to suppress. With
the correct sort and partition functions, you can make these killer records
appear first in the reduce input. Then, if your reducer sees the kill flag
in the front of the values, it can avoid processing any extra data.
Ted,
Will this work for the case where the cutoff frequency/count requires a
global picture? I guess not.
In general, it is better to not try to communicate between map and reduce
except via the expected mechanisms.
On 4/16/08 1:33 PM, "Aayush Garg" <[EMAIL PROTECTED]> wrote:
We can not read HashMap in the configure method of the reducer because it is
called before reduce job.
I need to eliminate rows from the HashMap when all the keys are read.
Also my concern is if dataset is large will this HashMap thing work??
On Wed, Apr 16, 2008 at 10:07 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:
That design is fine.
You should read your map in the configure method of the reducer.
There is a MapFile format supported by Hadoop, but they tend to be pretty
slow. I usually find it better to just load my hash table by hand. If
you
do this, you should use whatever format you like.
On 4/16/08 12:41 PM, "Aayush Garg" <[EMAIL PROTECTED]> wrote:
HI,
The current structure of my program is::
Upper class{
class Reduce{
reduce function(K1,V1,K2,V2){
// I count the frequency for each key
// Add output in HashMap(Key,value) instead of output.collect()
}
}
void run()
{
runjob();
// Now eliminate top frequency keys in HashMap built in reduce
function
here because only now hashmap is complete.
// Write this hashmap to a file in such a format so that I can use
this
hashmap in next MapReduce job and key of this hashmap is taken as key in
mapper function of that Map Reduce. ?? How and which format should I
choose??? Is this design and approach ok?
}
public static void main() {}
}
I hope you have got my question.
Thanks,
On Wed, Apr 16, 2008 at 8:33 AM, Amar Kamat <[EMAIL PROTECTED]>
wrote:
Aayush Garg wrote:
Hi,
Are you sure that another MR is required for eliminating some rows?
Can't I
just somehow eliminate from main() when I know the keys which are
needed
to
remove?
Can you provide some more details on how exactly are you filtering?
Amar