[ 
https://issues.apache.org/jira/browse/NUTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Espen Amble Kolstad updated NUTCH-498:
--------------------------------------

    Description: 
I tried to add the follwing combiner to LinkDb

   public static enum Counters {COMBINED}

   public static class LinkDbCombiner extends MapReduceBase implements Reducer {
      private int _maxInlinks;

      @Override
      public void configure(JobConf job) {
         super.configure(job);
         _maxInlinks = job.getInt("db.max.inlinks", 10000);
      }

      public void reduce(WritableComparable key, Iterator values, 
OutputCollector output, Reporter reporter) throws IOException {
            final Inlinks inlinks = (Inlinks) values.next();
            int combined = 0;
            while (values.hasNext()) {
               Inlinks val = (Inlinks) values.next();
               for (Iterator it = val.iterator(); it.hasNext();) {
                  if (inlinks.size() >= _maxInlinks) {
                     if (combined > 0) {
                        reporter.incrCounter(Counters.COMBINED, combined);
                     }
                     output.collect(key, inlinks);
                     return;
                  }
                  Inlink in = (Inlink) it.next();
                  inlinks.add(in);
               }
               combined++;
            }
            if (inlinks.size() == 0) {
               return;
            }
            if (combined > 0) {
               reporter.incrCounter(Counters.COMBINED, combined);
            }
            output.collect(key, inlinks);
      }
   }

This greatly reduced the time it took to generate a new linkdb. In my case it 
reduced the time by half.


Map output records    8717810541
Combined                  7632541507
Resulting output rec 1085269034

That's a 87% reduction of output records from the map phase

  was:
I tried to add the follwing combiner to LinkDb

{code}
   public static class LinkDbCombiner extends MapReduceBase implements Reducer {
      private int _maxInlinks;

      @Override
      public void configure(JobConf job) {
         super.configure(job);
         _maxInlinks = job.getInt("db.max.inlinks", 10000);
      }

      public void reduce(WritableComparable key, Iterator values, 
OutputCollector output, Reporter reporter) throws IOException {
            final Inlinks inlinks = (Inlinks) values.next();
            int combined = 0;
            while (values.hasNext()) {
               Inlinks val = (Inlinks) values.next();
               for (Iterator it = val.iterator(); it.hasNext();) {
                  if (inlinks.size() >= _maxInlinks) {
                     output.collect(key, inlinks);
                     return;
                  }
                  Inlink in = (Inlink) it.next();
                  inlinks.add(in);
               }
               combined++;
            }
            if (inlinks.size() == 0) {
               return;
            }
            if (combined > 0) {
               reporter.incrCounter(Counters.COMBINED, combined);
            }
            output.collect(key, inlinks);
      }
   }
{code}

This greatly reduced the time it took to generate a new linkdb. In my case it 
reduced the time by half.


|Map output records|8717810541|
|Combined|7632541507|
|Resulting output rec11085269034|

That's a 87% reduction of output records from the map phase


> Use Combiner in LinkDb to increase speed of linkdb generation
> -------------------------------------------------------------
>
>                 Key: NUTCH-498
>                 URL: https://issues.apache.org/jira/browse/NUTCH-498
>             Project: Nutch
>          Issue Type: Improvement
>          Components: linkdb
>    Affects Versions: 0.9.0
>            Reporter: Espen Amble Kolstad
>            Priority: Minor
>
> I tried to add the follwing combiner to LinkDb
>    public static enum Counters {COMBINED}
>    public static class LinkDbCombiner extends MapReduceBase implements 
> Reducer {
>       private int _maxInlinks;
>       @Override
>       public void configure(JobConf job) {
>          super.configure(job);
>          _maxInlinks = job.getInt("db.max.inlinks", 10000);
>       }
>       public void reduce(WritableComparable key, Iterator values, 
> OutputCollector output, Reporter reporter) throws IOException {
>             final Inlinks inlinks = (Inlinks) values.next();
>             int combined = 0;
>             while (values.hasNext()) {
>                Inlinks val = (Inlinks) values.next();
>                for (Iterator it = val.iterator(); it.hasNext();) {
>                   if (inlinks.size() >= _maxInlinks) {
>                      if (combined > 0) {
>                         reporter.incrCounter(Counters.COMBINED, combined);
>                      }
>                      output.collect(key, inlinks);
>                      return;
>                   }
>                   Inlink in = (Inlink) it.next();
>                   inlinks.add(in);
>                }
>                combined++;
>             }
>             if (inlinks.size() == 0) {
>                return;
>             }
>             if (combined > 0) {
>                reporter.incrCounter(Counters.COMBINED, combined);
>             }
>             output.collect(key, inlinks);
>       }
>    }
> This greatly reduced the time it took to generate a new linkdb. In my case it 
> reduced the time by half.
> Map output records    8717810541
> Combined                  7632541507
> Resulting output rec 1085269034
> That's a 87% reduction of output records from the map phase

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to