HI Amar , Theodore, Arun, Thanks for your reply. Actaully I am new to hadoop so cant figure out much. I have written following code for inverted index. This code maps each word from the document to its document id. ex: apple file1 file123 Main functions of the code are:-
public class HadoopProgram extends Configured implements Tool { public static class MapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); private Text doc = new Text(); private long numRecords=0; private String inputFile; public void configure(JobConf job){ System.out.println("Configure function is called"); inputFile = job.get("map.input.file"); System.out.println("In conf the input file is"+inputFile); } public void map(LongWritable key, Text value, OutputCollector<Text, Text> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); doc.set(inputFile); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word,doc); } if(++numRecords%4==0){ System.out.println("Finished processing of input file"+inputFile); } } } /** * A reducer class that just emits the sum of the input values. */ public static class Reduce extends MapReduceBase implements Reducer<Text, Text, Text, DocIDs> { // This works as K2, V2, K3, V3 public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, DocIDs> output, Reporter reporter) throws IOException { int sum = 0; Text dummy = new Text(); ArrayList<String> IDs = new ArrayList<String>(); String str; while (values.hasNext()) { dummy = values.next(); str = dummy.toString(); IDs.add(str); } DocIDs dc = new DocIDs(); dc.setListdocs(IDs); output.collect(key,dc); } } public int run(String[] args) throws Exception { System.out.println("Run function is called"); JobConf conf = new JobConf(getConf(), WordCount.class); conf.setJobName("wordcount"); // the keys are words (strings) conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(Text.class); conf.setMapperClass(MapClass.class); conf.setReducerClass(Reduce.class); } Now I am getting output array from the reducer as:- word \root\test\test123, \root\test12 In the next stage I want to stop 'stop words', scrub words etc. and like position of the word in the document. How would I apply multiple maps or multilevel map reduce jobs programmatically? I guess I need to make another class or add some functions in it? I am not able to figure it out. Any pointers for these type of problems? Thanks, Aayush On Thu, Mar 27, 2008 at 6:14 AM, Amar Kamat <[EMAIL PROTECTED]> wrote: > On Wed, 26 Mar 2008, Aayush Garg wrote: > > > HI, > > I am developing the simple inverted index program frm the hadoop. My map > > function has the output: > > <word, doc> > > and the reducer has: > > <word, list(docs)> > > > > Now I want to use one more mapreduce to remove stop and scrub words from > Use distributed cache as Arun mentioned. > > this output. Also in the next stage I would like to have short summay > Whether to use a separate MR job depends on what exactly you mean by > summary. If its like a window around the current word then you can > possibly do it in one go. > Amar > > associated with every word. How should I design my program from this > stage? > > I mean how would I apply multiple mapreduce to this? What would be the > > better way to perform this? > > > > Thanks, > > > > Regards, > > - > > > > >