Use Combiner in LinkDb to increase speed of linkdb generation
-------------------------------------------------------------

                 Key: NUTCH-498
                 URL: https://issues.apache.org/jira/browse/NUTCH-498
             Project: Nutch
          Issue Type: Improvement
          Components: linkdb
    Affects Versions: 0.9.0
            Reporter: Espen Amble Kolstad
            Priority: Minor


I tried to add the follwing combiner to LinkDb

{code}
   public static class LinkDbCombiner extends MapReduceBase implements Reducer {
      private int _maxInlinks;

      @Override
      public void configure(JobConf job) {
         super.configure(job);
         _maxInlinks = job.getInt("db.max.inlinks", 10000);
      }

      public void reduce(WritableComparable key, Iterator values, 
OutputCollector output, Reporter reporter) throws IOException {
            final Inlinks inlinks = (Inlinks) values.next();
            int combined = 0;
            while (values.hasNext()) {
               Inlinks val = (Inlinks) values.next();
               for (Iterator it = val.iterator(); it.hasNext();) {
                  if (inlinks.size() >= _maxInlinks) {
                     output.collect(key, inlinks);
                     return;
                  }
                  Inlink in = (Inlink) it.next();
                  inlinks.add(in);
               }
               combined++;
            }
            if (inlinks.size() == 0) {
               return;
            }
            if (combined > 0) {
               reporter.incrCounter(Counters.COMBINED, combined);
            }
            output.collect(key, inlinks);
      }
   }
{code}

This greatly reduced the time it took to generate a new linkdb. In my case it 
reduced the time by half.


|Map output records|8717810541|
|Combined|7632541507|
|Resulting output rec11085269034|

That's a 87% reduction of output records from the map phase

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to