Hi,
JobConf has some default values, which are IdentityMapper and
IdentityReducer. These functors, as their name implies, does not alter
the data but pass intact. The dump job does not need to alter the data
but to transform from (binary) SequenceFile (InputFormat) to text
(OutputFormat).
phonechen wrote:
Hello all:
When I read the Nutch source code I found that the processDumpJob(String
crawlDb, String output, Configuration config) the in the CrawlDbReader.java
only set some inputFormat & outputFormat ,and without Mapper and Reducer
class. But it can dump the existing crawldb to a text format.
Can anyone tell me how does it work?
Thanks!
Here is the source code:
-----------------------------------------------
public void processDumpJob(String crawlDb, String output, Configuration
config) throws IOException {
if (LOG.isInfoEnabled()) {
LOG.info("CrawlDb dump: starting");
LOG.info("CrawlDb db: " + crawlDb);
}
Path outFolder = new Path(output);
JobConf job = new NutchJob(config);
job.setJobName("dump " + crawlDb);
job.addInputPath(new Path(crawlDb,
CrawlDb.CURRENT_NAME));
job.setInputFormat(SequenceFileInputFormat.class);
job.setOutputPath(outFolder);
job.setOutputFormat(TextOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(CrawlDatum.class);
JobClient.runJob(job);
if (LOG.isInfoEnabled()) { LOG.info("CrawlDb dump:
done"); }
}
----------------------------------------------------