See Nutch. See Nutch run.
http://en.wikipedia.org/wiki/Nutch http://lucene.apache.org/nutch/ On 4/4/08 1:22 PM, "Aayush Garg" <[EMAIL PROTECTED]> wrote: > Hi, > > I have not used lucene index ever before. I do not get how we build it with > hadoop Map reduce. Basically what I was looking for like how to implement > multilevel map/reduce for my mentioned problem. > > > On Fri, Apr 4, 2008 at 7:23 PM, Ning Li <[EMAIL PROTECTED]> wrote: > >> You can build Lucene indexes using Hadoop Map/Reduce. See the index >> contrib package in the trunk. Or is it still not something you are >> looking for? >> >> Regards, >> Ning >> >> On 4/4/08, Aayush Garg <[EMAIL PROTECTED]> wrote: >>> No, currently my requirement is to solve this problem by apache hadoop. >> I am >>> trying to build up this type of inverted index and then measure >> performance >>> criteria with respect to others. >>> >>> Thanks, >>> >>> >>> On Fri, Apr 4, 2008 at 5:54 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: >>> >>>> >>>> Are you implementing this for instruction or production? >>>> >>>> If production, why not use Lucene? >>>> >>>> >>>> On 4/3/08 6:45 PM, "Aayush Garg" <[EMAIL PROTECTED]> wrote: >>>> >>>>> HI Amar , Theodore, Arun, >>>>> >>>>> Thanks for your reply. Actaully I am new to hadoop so cant figure >> out >>>> much. >>>>> I have written following code for inverted index. This code maps >> each >>>> word >>>>> from the document to its document id. >>>>> ex: apple file1 file123 >>>>> Main functions of the code are:- >>>>> >>>>> public class HadoopProgram extends Configured implements Tool { >>>>> public static class MapClass extends MapReduceBase >>>>> implements Mapper<LongWritable, Text, Text, Text> { >>>>> >>>>> private final static IntWritable one = new IntWritable(1); >>>>> private Text word = new Text(); >>>>> private Text doc = new Text(); >>>>> private long numRecords=0; >>>>> private String inputFile; >>>>> >>>>> public void configure(JobConf job){ >>>>> System.out.println("Configure function is called"); >>>>> inputFile = job.get("map.input.file"); >>>>> System.out.println("In conf the input file is"+inputFile); >>>>> } >>>>> >>>>> >>>>> public void map(LongWritable key, Text value, >>>>> OutputCollector<Text, Text> output, >>>>> Reporter reporter) throws IOException { >>>>> String line = value.toString(); >>>>> StringTokenizer itr = new StringTokenizer(line); >>>>> doc.set(inputFile); >>>>> while (itr.hasMoreTokens()) { >>>>> word.set(itr.nextToken()); >>>>> output.collect(word,doc); >>>>> } >>>>> if(++numRecords%4==0){ >>>>> System.out.println("Finished processing of input >>>> file"+inputFile); >>>>> } >>>>> } >>>>> } >>>>> >>>>> /** >>>>> * A reducer class that just emits the sum of the input values. >>>>> */ >>>>> public static class Reduce extends MapReduceBase >>>>> implements Reducer<Text, Text, Text, DocIDs> { >>>>> >>>>> // This works as K2, V2, K3, V3 >>>>> public void reduce(Text key, Iterator<Text> values, >>>>> OutputCollector<Text, DocIDs> output, >>>>> Reporter reporter) throws IOException { >>>>> int sum = 0; >>>>> Text dummy = new Text(); >>>>> ArrayList<String> IDs = new ArrayList<String>(); >>>>> String str; >>>>> >>>>> while (values.hasNext()) { >>>>> dummy = values.next(); >>>>> str = dummy.toString(); >>>>> IDs.add(str); >>>>> } >>>>> DocIDs dc = new DocIDs(); >>>>> dc.setListdocs(IDs); >>>>> output.collect(key,dc); >>>>> } >>>>> } >>>>> >>>>> public int run(String[] args) throws Exception { >>>>> System.out.println("Run function is called"); >>>>> JobConf conf = new JobConf(getConf(), WordCount.class); >>>>> conf.setJobName("wordcount"); >>>>> >>>>> // the keys are words (strings) >>>>> conf.setOutputKeyClass(Text.class); >>>>> >>>>> conf.setOutputValueClass(Text.class); >>>>> >>>>> >>>>> conf.setMapperClass(MapClass.class); >>>>> >>>>> conf.setReducerClass(Reduce.class); >>>>> } >>>>> >>>>> >>>>> Now I am getting output array from the reducer as:- >>>>> word \root\test\test123, \root\test12 >>>>> >>>>> In the next stage I want to stop 'stop words', scrub words etc. >> and >>>> like >>>>> position of the word in the document. How would I apply multiple >> maps or >>>>> multilevel map reduce jobs programmatically? I guess I need to make >>>> another >>>>> class or add some functions in it? I am not able to figure it out. >>>>> Any pointers for these type of problems? >>>>> >>>>> Thanks, >>>>> Aayush >>>>> >>>>> >>>>> On Thu, Mar 27, 2008 at 6:14 AM, Amar Kamat <[EMAIL PROTECTED]> >>>> wrote: >>>>> >>>>>> On Wed, 26 Mar 2008, Aayush Garg wrote: >>>>>> >>>>>>> HI, >>>>>>> I am developing the simple inverted index program frm the hadoop. >> My >>>> map >>>>>>> function has the output: >>>>>>> <word, doc> >>>>>>> and the reducer has: >>>>>>> <word, list(docs)> >>>>>>> >>>>>>> Now I want to use one more mapreduce to remove stop and scrub >> words >>>> from >>>>>> Use distributed cache as Arun mentioned. >>>>>>> this output. Also in the next stage I would like to have short >> summay >>>>>> Whether to use a separate MR job depends on what exactly you mean >> by >>>>>> summary. If its like a window around the current word then you can >>>>>> possibly do it in one go. >>>>>> Amar >>>>>>> associated with every word. How should I design my program from >> this >>>>>> stage? >>>>>>> I mean how would I apply multiple mapreduce to this? What would be >> the >>>>>>> better way to perform this? >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> Regards, >>>>>>> - >>>>>>> >>>>>>> >>>>>> >>>> >>>> >>> >>> >>> -- >>> Aayush Garg, >>> Phone: +41 76 482 240 >>> >> > >