[ https://issues.apache.org/jira/browse/NUTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Espen Amble Kolstad updated NUTCH-498: -------------------------------------- Description: I tried to add the follwing combiner to LinkDb public static enum Counters {COMBINED} public static class LinkDbCombiner extends MapReduceBase implements Reducer { private int _maxInlinks; @Override public void configure(JobConf job) { super.configure(job); _maxInlinks = job.getInt("db.max.inlinks", 10000); } public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { final Inlinks inlinks = (Inlinks) values.next(); int combined = 0; while (values.hasNext()) { Inlinks val = (Inlinks) values.next(); for (Iterator it = val.iterator(); it.hasNext();) { if (inlinks.size() >= _maxInlinks) { if (combined > 0) { reporter.incrCounter(Counters.COMBINED, combined); } output.collect(key, inlinks); return; } Inlink in = (Inlink) it.next(); inlinks.add(in); } combined++; } if (inlinks.size() == 0) { return; } if (combined > 0) { reporter.incrCounter(Counters.COMBINED, combined); } output.collect(key, inlinks); } } This greatly reduced the time it took to generate a new linkdb. In my case it reduced the time by half. Map output records 8717810541 Combined 7632541507 Resulting output rec 1085269034 That's a 87% reduction of output records from the map phase was: I tried to add the follwing combiner to LinkDb {code} public static class LinkDbCombiner extends MapReduceBase implements Reducer { private int _maxInlinks; @Override public void configure(JobConf job) { super.configure(job); _maxInlinks = job.getInt("db.max.inlinks", 10000); } public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { final Inlinks inlinks = (Inlinks) values.next(); int combined = 0; while (values.hasNext()) { Inlinks val = (Inlinks) values.next(); for (Iterator it = val.iterator(); it.hasNext();) { if (inlinks.size() >= _maxInlinks) { output.collect(key, inlinks); return; } Inlink in = (Inlink) it.next(); inlinks.add(in); } combined++; } if (inlinks.size() == 0) { return; } if (combined > 0) { reporter.incrCounter(Counters.COMBINED, combined); } output.collect(key, inlinks); } } {code} This greatly reduced the time it took to generate a new linkdb. In my case it reduced the time by half. |Map output records|8717810541| |Combined|7632541507| |Resulting output rec11085269034| That's a 87% reduction of output records from the map phase > Use Combiner in LinkDb to increase speed of linkdb generation > ------------------------------------------------------------- > > Key: NUTCH-498 > URL: https://issues.apache.org/jira/browse/NUTCH-498 > Project: Nutch > Issue Type: Improvement > Components: linkdb > Affects Versions: 0.9.0 > Reporter: Espen Amble Kolstad > Priority: Minor > > I tried to add the follwing combiner to LinkDb > public static enum Counters {COMBINED} > public static class LinkDbCombiner extends MapReduceBase implements > Reducer { > private int _maxInlinks; > @Override > public void configure(JobConf job) { > super.configure(job); > _maxInlinks = job.getInt("db.max.inlinks", 10000); > } > public void reduce(WritableComparable key, Iterator values, > OutputCollector output, Reporter reporter) throws IOException { > final Inlinks inlinks = (Inlinks) values.next(); > int combined = 0; > while (values.hasNext()) { > Inlinks val = (Inlinks) values.next(); > for (Iterator it = val.iterator(); it.hasNext();) { > if (inlinks.size() >= _maxInlinks) { > if (combined > 0) { > reporter.incrCounter(Counters.COMBINED, combined); > } > output.collect(key, inlinks); > return; > } > Inlink in = (Inlink) it.next(); > inlinks.add(in); > } > combined++; > } > if (inlinks.size() == 0) { > return; > } > if (combined > 0) { > reporter.incrCounter(Counters.COMBINED, combined); > } > output.collect(key, inlinks); > } > } > This greatly reduced the time it took to generate a new linkdb. In my case it > reduced the time by half. > Map output records 8717810541 > Combined 7632541507 > Resulting output rec 1085269034 > That's a 87% reduction of output records from the map phase -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers