Hello all:
When I read the Nutch source code I found that the processDumpJob(String
crawlDb, String output, Configuration config) the in the CrawlDbReader.java
only set some inputFormat & outputFormat ,and without Mapper and Reducer
class. But it can dump the existing crawldb to a text format.
Can anyone tell me how does it work?
Thanks!
Here is the source code:
-----------------------------------------------
public void processDumpJob(String crawlDb, String output, Configuration
config) throws IOException {
if (LOG.isInfoEnabled()) {
LOG.info("CrawlDb dump: starting");
LOG.info("CrawlDb db: " + crawlDb);
}
Path outFolder = new Path(output);
JobConf job = new NutchJob(config);
job.setJobName("dump " + crawlDb);
job.addInputPath(new Path(crawlDb,
CrawlDb.CURRENT_NAME));
job.setInputFormat(SequenceFileInputFormat.class);
job.setOutputPath(outFolder);
job.setOutputFormat(TextOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(CrawlDatum.class);
JobClient.runJob(job);
if (LOG.isInfoEnabled()) { LOG.info("CrawlDb dump:
done"); }
}
----------------------------------------------------
--
--~--~---------~--~----~------------~-------~--
Best Regards,
Yours
Phonechen
-~----------~----~----~----~------~----~------