Use Combiner in LinkDb to increase speed of linkdb generation -------------------------------------------------------------
Key: NUTCH-498 URL: https://issues.apache.org/jira/browse/NUTCH-498 Project: Nutch Issue Type: Improvement Components: linkdb Affects Versions: 0.9.0 Reporter: Espen Amble Kolstad Priority: Minor I tried to add the follwing combiner to LinkDb {code} public static class LinkDbCombiner extends MapReduceBase implements Reducer { private int _maxInlinks; @Override public void configure(JobConf job) { super.configure(job); _maxInlinks = job.getInt("db.max.inlinks", 10000); } public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { final Inlinks inlinks = (Inlinks) values.next(); int combined = 0; while (values.hasNext()) { Inlinks val = (Inlinks) values.next(); for (Iterator it = val.iterator(); it.hasNext();) { if (inlinks.size() >= _maxInlinks) { output.collect(key, inlinks); return; } Inlink in = (Inlink) it.next(); inlinks.add(in); } combined++; } if (inlinks.size() == 0) { return; } if (combined > 0) { reporter.incrCounter(Counters.COMBINED, combined); } output.collect(key, inlinks); } } {code} This greatly reduced the time it took to generate a new linkdb. In my case it reduced the time by half. |Map output records|8717810541| |Combined|7632541507| |Resulting output rec11085269034| That's a 87% reduction of output records from the map phase -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers