Are you implementing this for instruction or production? If production, why not use Lucene?
On 4/3/08 6:45 PM, "Aayush Garg" <[EMAIL PROTECTED]> wrote: > HI Amar , Theodore, Arun, > > Thanks for your reply. Actaully I am new to hadoop so cant figure out much. > I have written following code for inverted index. This code maps each word > from the document to its document id. > ex: apple file1 file123 > Main functions of the code are:- > > public class HadoopProgram extends Configured implements Tool { > public static class MapClass extends MapReduceBase > implements Mapper<LongWritable, Text, Text, Text> { > > private final static IntWritable one = new IntWritable(1); > private Text word = new Text(); > private Text doc = new Text(); > private long numRecords=0; > private String inputFile; > > public void configure(JobConf job){ > System.out.println("Configure function is called"); > inputFile = job.get("map.input.file"); > System.out.println("In conf the input file is"+inputFile); > } > > > public void map(LongWritable key, Text value, > OutputCollector<Text, Text> output, > Reporter reporter) throws IOException { > String line = value.toString(); > StringTokenizer itr = new StringTokenizer(line); > doc.set(inputFile); > while (itr.hasMoreTokens()) { > word.set(itr.nextToken()); > output.collect(word,doc); > } > if(++numRecords%4==0){ > System.out.println("Finished processing of input file"+inputFile); > } > } > } > > /** > * A reducer class that just emits the sum of the input values. > */ > public static class Reduce extends MapReduceBase > implements Reducer<Text, Text, Text, DocIDs> { > > // This works as K2, V2, K3, V3 > public void reduce(Text key, Iterator<Text> values, > OutputCollector<Text, DocIDs> output, > Reporter reporter) throws IOException { > int sum = 0; > Text dummy = new Text(); > ArrayList<String> IDs = new ArrayList<String>(); > String str; > > while (values.hasNext()) { > dummy = values.next(); > str = dummy.toString(); > IDs.add(str); > } > DocIDs dc = new DocIDs(); > dc.setListdocs(IDs); > output.collect(key,dc); > } > } > > public int run(String[] args) throws Exception { > System.out.println("Run function is called"); > JobConf conf = new JobConf(getConf(), WordCount.class); > conf.setJobName("wordcount"); > > // the keys are words (strings) > conf.setOutputKeyClass(Text.class); > > conf.setOutputValueClass(Text.class); > > > conf.setMapperClass(MapClass.class); > > conf.setReducerClass(Reduce.class); > } > > > Now I am getting output array from the reducer as:- > word \root\test\test123, \root\test12 > > In the next stage I want to stop 'stop words', scrub words etc. and like > position of the word in the document. How would I apply multiple maps or > multilevel map reduce jobs programmatically? I guess I need to make another > class or add some functions in it? I am not able to figure it out. > Any pointers for these type of problems? > > Thanks, > Aayush > > > On Thu, Mar 27, 2008 at 6:14 AM, Amar Kamat <[EMAIL PROTECTED]> wrote: > >> On Wed, 26 Mar 2008, Aayush Garg wrote: >> >>> HI, >>> I am developing the simple inverted index program frm the hadoop. My map >>> function has the output: >>> <word, doc> >>> and the reducer has: >>> <word, list(docs)> >>> >>> Now I want to use one more mapreduce to remove stop and scrub words from >> Use distributed cache as Arun mentioned. >>> this output. Also in the next stage I would like to have short summay >> Whether to use a separate MR job depends on what exactly you mean by >> summary. If its like a window around the current word then you can >> possibly do it in one go. >> Amar >>> associated with every word. How should I design my program from this >> stage? >>> I mean how would I apply multiple mapreduce to this? What would be the >>> better way to perform this? >>> >>> Thanks, >>> >>> Regards, >>> - >>> >>> >>