You can build Lucene indexes using Hadoop Map/Reduce. See the index contrib package in the trunk. Or is it still not something you are looking for?
Regards, Ning On 4/4/08, Aayush Garg <[EMAIL PROTECTED]> wrote: > No, currently my requirement is to solve this problem by apache hadoop. I am > trying to build up this type of inverted index and then measure performance > criteria with respect to others. > > Thanks, > > > On Fri, Apr 4, 2008 at 5:54 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > > > > > Are you implementing this for instruction or production? > > > > If production, why not use Lucene? > > > > > > On 4/3/08 6:45 PM, "Aayush Garg" <[EMAIL PROTECTED]> wrote: > > > > > HI Amar , Theodore, Arun, > > > > > > Thanks for your reply. Actaully I am new to hadoop so cant figure out > > much. > > > I have written following code for inverted index. This code maps each > > word > > > from the document to its document id. > > > ex: apple file1 file123 > > > Main functions of the code are:- > > > > > > public class HadoopProgram extends Configured implements Tool { > > > public static class MapClass extends MapReduceBase > > > implements Mapper<LongWritable, Text, Text, Text> { > > > > > > private final static IntWritable one = new IntWritable(1); > > > private Text word = new Text(); > > > private Text doc = new Text(); > > > private long numRecords=0; > > > private String inputFile; > > > > > > public void configure(JobConf job){ > > > System.out.println("Configure function is called"); > > > inputFile = job.get("map.input.file"); > > > System.out.println("In conf the input file is"+inputFile); > > > } > > > > > > > > > public void map(LongWritable key, Text value, > > > OutputCollector<Text, Text> output, > > > Reporter reporter) throws IOException { > > > String line = value.toString(); > > > StringTokenizer itr = new StringTokenizer(line); > > > doc.set(inputFile); > > > while (itr.hasMoreTokens()) { > > > word.set(itr.nextToken()); > > > output.collect(word,doc); > > > } > > > if(++numRecords%4==0){ > > > System.out.println("Finished processing of input > > file"+inputFile); > > > } > > > } > > > } > > > > > > /** > > > * A reducer class that just emits the sum of the input values. > > > */ > > > public static class Reduce extends MapReduceBase > > > implements Reducer<Text, Text, Text, DocIDs> { > > > > > > // This works as K2, V2, K3, V3 > > > public void reduce(Text key, Iterator<Text> values, > > > OutputCollector<Text, DocIDs> output, > > > Reporter reporter) throws IOException { > > > int sum = 0; > > > Text dummy = new Text(); > > > ArrayList<String> IDs = new ArrayList<String>(); > > > String str; > > > > > > while (values.hasNext()) { > > > dummy = values.next(); > > > str = dummy.toString(); > > > IDs.add(str); > > > } > > > DocIDs dc = new DocIDs(); > > > dc.setListdocs(IDs); > > > output.collect(key,dc); > > > } > > > } > > > > > > public int run(String[] args) throws Exception { > > > System.out.println("Run function is called"); > > > JobConf conf = new JobConf(getConf(), WordCount.class); > > > conf.setJobName("wordcount"); > > > > > > // the keys are words (strings) > > > conf.setOutputKeyClass(Text.class); > > > > > > conf.setOutputValueClass(Text.class); > > > > > > > > > conf.setMapperClass(MapClass.class); > > > > > > conf.setReducerClass(Reduce.class); > > > } > > > > > > > > > Now I am getting output array from the reducer as:- > > > word \root\test\test123, \root\test12 > > > > > > In the next stage I want to stop 'stop words', scrub words etc. and > > like > > > position of the word in the document. How would I apply multiple maps or > > > multilevel map reduce jobs programmatically? I guess I need to make > > another > > > class or add some functions in it? I am not able to figure it out. > > > Any pointers for these type of problems? > > > > > > Thanks, > > > Aayush > > > > > > > > > On Thu, Mar 27, 2008 at 6:14 AM, Amar Kamat <[EMAIL PROTECTED]> > > wrote: > > > > > >> On Wed, 26 Mar 2008, Aayush Garg wrote: > > >> > > >>> HI, > > >>> I am developing the simple inverted index program frm the hadoop. My > > map > > >>> function has the output: > > >>> <word, doc> > > >>> and the reducer has: > > >>> <word, list(docs)> > > >>> > > >>> Now I want to use one more mapreduce to remove stop and scrub words > > from > > >> Use distributed cache as Arun mentioned. > > >>> this output. Also in the next stage I would like to have short summay > > >> Whether to use a separate MR job depends on what exactly you mean by > > >> summary. If its like a window around the current word then you can > > >> possibly do it in one go. > > >> Amar > > >>> associated with every word. How should I design my program from this > > >> stage? > > >>> I mean how would I apply multiple mapreduce to this? What would be the > > >>> better way to perform this? > > >>> > > >>> Thanks, > > >>> > > >>> Regards, > > >>> - > > >>> > > >>> > > >> > > > > > > > -- > Aayush Garg, > Phone: +41 76 482 240 >