You can build Lucene indexes using Hadoop Map/Reduce. See the index
contrib package in the trunk. Or is it still not something you are
looking for?

Regards,
Ning

On 4/4/08, Aayush Garg <[EMAIL PROTECTED]> wrote:
> No, currently my requirement is to solve this problem by apache hadoop. I am
> trying to build up this type of inverted index and then measure performance
> criteria with respect to others.
>
> Thanks,
>
>
> On Fri, Apr 4, 2008 at 5:54 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:
>
> >
> > Are you implementing this for instruction or production?
> >
> > If production, why not use Lucene?
> >
> >
> > On 4/3/08 6:45 PM, "Aayush Garg" <[EMAIL PROTECTED]> wrote:
> >
> > > HI  Amar , Theodore, Arun,
> > >
> > > Thanks for your reply. Actaully I am new to hadoop so cant figure out
> > much.
> > > I have written following code for inverted index. This code maps each
> > word
> > > from the document to its document id.
> > > ex: apple file1 file123
> > > Main functions of the code are:-
> > >
> > > public class HadoopProgram extends Configured implements Tool {
> > > public static class MapClass extends MapReduceBase
> > >     implements Mapper<LongWritable, Text, Text, Text> {
> > >
> > >     private final static IntWritable one = new IntWritable(1);
> > >     private Text word = new Text();
> > >     private Text doc = new Text();
> > >     private long numRecords=0;
> > >     private String inputFile;
> > >
> > >    public void configure(JobConf job){
> > >         System.out.println("Configure function is called");
> > >         inputFile = job.get("map.input.file");
> > >         System.out.println("In conf the input file is"+inputFile);
> > >     }
> > >
> > >
> > >     public void map(LongWritable key, Text value,
> > >                     OutputCollector<Text, Text> output,
> > >                     Reporter reporter) throws IOException {
> > >       String line = value.toString();
> > >       StringTokenizer itr = new StringTokenizer(line);
> > >       doc.set(inputFile);
> > >       while (itr.hasMoreTokens()) {
> > >         word.set(itr.nextToken());
> > >         output.collect(word,doc);
> > >       }
> > >       if(++numRecords%4==0){
> > >        System.out.println("Finished processing of input
> > file"+inputFile);
> > >      }
> > >     }
> > >   }
> > >
> > >   /**
> > >    * A reducer class that just emits the sum of the input values.
> > >    */
> > >   public static class Reduce extends MapReduceBase
> > >     implements Reducer<Text, Text, Text, DocIDs> {
> > >
> > >   // This works as K2, V2, K3, V3
> > >     public void reduce(Text key, Iterator<Text> values,
> > >                        OutputCollector<Text, DocIDs> output,
> > >                        Reporter reporter) throws IOException {
> > >       int sum = 0;
> > >       Text dummy = new Text();
> > >       ArrayList<String> IDs = new ArrayList<String>();
> > >       String str;
> > >
> > >       while (values.hasNext()) {
> > >          dummy = values.next();
> > >          str = dummy.toString();
> > >          IDs.add(str);
> > >        }
> > >        DocIDs dc = new DocIDs();
> > >        dc.setListdocs(IDs);
> > >       output.collect(key,dc);
> > >     }
> > >   }
> > >
> > >  public int run(String[] args) throws Exception {
> > >   System.out.println("Run function is called");
> > >     JobConf conf = new JobConf(getConf(), WordCount.class);
> > >     conf.setJobName("wordcount");
> > >
> > >     // the keys are words (strings)
> > >     conf.setOutputKeyClass(Text.class);
> > >
> > >     conf.setOutputValueClass(Text.class);
> > >
> > >
> > >     conf.setMapperClass(MapClass.class);
> > >
> > >     conf.setReducerClass(Reduce.class);
> > > }
> > >
> > >
> > > Now I am getting output array from the reducer as:-
> > > word \root\test\test123, \root\test12
> > >
> > > In the next stage I want to stop 'stop  words',  scrub words etc. and
> > like
> > > position of the word in the document. How would I apply multiple maps or
> > > multilevel map reduce jobs programmatically? I guess I need to make
> > another
> > > class or add some functions in it? I am not able to figure it out.
> > > Any pointers for these type of problems?
> > >
> > > Thanks,
> > > Aayush
> > >
> > >
> > > On Thu, Mar 27, 2008 at 6:14 AM, Amar Kamat <[EMAIL PROTECTED]>
> > wrote:
> > >
> > >> On Wed, 26 Mar 2008, Aayush Garg wrote:
> > >>
> > >>> HI,
> > >>> I am developing the simple inverted index program frm the hadoop. My
> > map
> > >>> function has the output:
> > >>> <word, doc>
> > >>> and the reducer has:
> > >>> <word, list(docs)>
> > >>>
> > >>> Now I want to use one more mapreduce to remove stop and scrub words
> > from
> > >> Use distributed cache as Arun mentioned.
> > >>> this output. Also in the next stage I would like to have short summay
> > >> Whether to use a separate MR job depends on what exactly you mean by
> > >> summary. If its like a window around the current word then you can
> > >> possibly do it in one go.
> > >> Amar
> > >>> associated with every word. How should I design my program from this
> > >> stage?
> > >>> I mean how would I apply multiple mapreduce to this? What would be the
> > >>> better way to perform this?
> > >>>
> > >>> Thanks,
> > >>>
> > >>> Regards,
> > >>> -
> > >>>
> > >>>
> > >>
> >
> >
>
>
> --
> Aayush Garg,
> Phone: +41 76 482 240
>

Reply via email to