Hi,

JobConf has some default values, which are IdentityMapper and IdentityReducer. These functors, as their name implies, does not alter the data but pass intact. The dump job does not need to alter the data but to transform from (binary) SequenceFile (InputFormat) to text (OutputFormat).

phonechen wrote:
Hello all:
When I read the Nutch source code I found that the processDumpJob(String
crawlDb, String output, Configuration config) the in the CrawlDbReader.java
only  set some inputFormat & outputFormat ,and without Mapper and Reducer
class. But it can dump the existing crawldb to a text format.
Can anyone tell me how does it work?
Thanks!

Here is the source code:
-----------------------------------------------
 public void processDumpJob(String crawlDb, String output, Configuration
config) throws IOException {

                    if (LOG.isInfoEnabled()) {
                      LOG.info("CrawlDb dump: starting");
                      LOG.info("CrawlDb db: " + crawlDb);
                    }


                    Path outFolder = new Path(output);

                    JobConf job = new NutchJob(config);
                    job.setJobName("dump " + crawlDb);

                    job.addInputPath(new Path(crawlDb,
CrawlDb.CURRENT_NAME));
                    job.setInputFormat(SequenceFileInputFormat.class);

                    job.setOutputPath(outFolder);
                    job.setOutputFormat(TextOutputFormat.class);
                    job.setOutputKeyClass(Text.class);
                    job.setOutputValueClass(CrawlDatum.class);

                    JobClient.runJob(job);
                    if (LOG.isInfoEnabled()) { LOG.info("CrawlDb dump:
done"); }
                  }
----------------------------------------------------




Reply via email to