Re: Hadoop: Multiple map reduce or some better way

Aayush Garg Thu, 03 Apr 2008 18:46:26 -0700

HI  Amar , Theodore, Arun,

Thanks for your reply. Actaully I am new to hadoop so cant figure out much.
I have written following code for inverted index. This code maps each word
from the document to its document id.
ex: apple file1 file123
Main functions of the code are:-


public class HadoopProgram extends Configured implements Tool {
public static class MapClass extends MapReduceBase
    implements Mapper<LongWritable, Text, Text, Text> {

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();
    private Text doc = new Text();
    private long numRecords=0;
    private String inputFile;

   public void configure(JobConf job){
        System.out.println("Configure function is called");
        inputFile = job.get("map.input.file");
        System.out.println("In conf the input file is"+inputFile);
    }


    public void map(LongWritable key, Text value,
                    OutputCollector<Text, Text> output,
                    Reporter reporter) throws IOException {
      String line = value.toString();
      StringTokenizer itr = new StringTokenizer(line);
      doc.set(inputFile);
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        output.collect(word,doc);
      }
      if(++numRecords%4==0){
       System.out.println("Finished processing of input file"+inputFile);
     }
    }
  }

  /**
   * A reducer class that just emits the sum of the input values.
   */
  public static class Reduce extends MapReduceBase
    implements Reducer<Text, Text, Text, DocIDs> {

  // This works as K2, V2, K3, V3
    public void reduce(Text key, Iterator<Text> values,
                       OutputCollector<Text, DocIDs> output,
                       Reporter reporter) throws IOException {
      int sum = 0;
      Text dummy = new Text();
      ArrayList<String> IDs = new ArrayList<String>();
      String str;

      while (values.hasNext()) {
         dummy = values.next();
         str = dummy.toString();
         IDs.add(str);
       }
       DocIDs dc = new DocIDs();
       dc.setListdocs(IDs);
      output.collect(key,dc);
    }
  }

 public int run(String[] args) throws Exception {
  System.out.println("Run function is called");
    JobConf conf = new JobConf(getConf(), WordCount.class);
    conf.setJobName("wordcount");

    // the keys are words (strings)
    conf.setOutputKeyClass(Text.class);

    conf.setOutputValueClass(Text.class);


    conf.setMapperClass(MapClass.class);

    conf.setReducerClass(Reduce.class);
}


Now I am getting output array from the reducer as:-
word \root\test\test123, \root\test12

In the next stage I want to stop 'stop  words',  scrub words etc. and like
position of the word in the document. How would I apply multiple maps or
multilevel map reduce jobs programmatically? I guess I need to make another
class or add some functions in it? I am not able to figure it out.
Any pointers for these type of problems?

Thanks,
Aayush


On Thu, Mar 27, 2008 at 6:14 AM, Amar Kamat <[EMAIL PROTECTED]> wrote:

> On Wed, 26 Mar 2008, Aayush Garg wrote:
>
> > HI,
> > I am developing the simple inverted index program frm the hadoop. My map
> > function has the output:
> > <word, doc>
> > and the reducer has:
> > <word, list(docs)>
> >
> > Now I want to use one more mapreduce to remove stop and scrub words from
> Use distributed cache as Arun mentioned.
> > this output. Also in the next stage I would like to have short summay
> Whether to use a separate MR job depends on what exactly you mean by
> summary. If its like a window around the current word then you can
> possibly do it in one go.
> Amar
> > associated with every word. How should I design my program from this
> stage?
> > I mean how would I apply multiple mapreduce to this? What would be the
> > better way to perform this?
> >
> > Thanks,
> >
> > Regards,
> > -
> >
> >
>

Re: Hadoop: Multiple map reduce or some better way

Reply via email to