Hi,
I have Nutch 2.x set up with Mysql and am seeing a peculiar null pointer
exception with a crawl with sample seeds from DMOZ. I decided to do fresh
crawl with only one url as seed and empty webpage table.
I am running *org.apache.nutch.crawl.Crawler* from eclipse with args *urls
-dir /home/binoy/lab/dmoz/apache-url -solr http://localhost:8983/solr/
-depth 1 -topN 1*
the apache-url seed file has only one entry ("http://nutch.apache.org/")
I see the following nullpointer exception : Logs :
http://pastebin.com/CaqJpPkn
With a little debugging from eclipse I see
conf.set(GeneratorJob.BATCH_ID, batchId);
in IndexerJob.java createIndexJob method being the root cause.
wrapping it in *if(batchId != null) *seems to solve the issue.
I wanted to know if this is a valid patch. It seems from grep-ing no on
else is reading GeneratorJob.BATCH_ID except indexerJob.
I am always seeing batchId passed as null for createIndexJob for clean
crawls (empty table), which scenario causes it to be not null? and what is
the significance generator job batchId for indexing job.
It seems a trivial issue and hence I didnot create a jira. I have attached
the small patch and would be glad if some one can take a look.
Regards,
Binoy
Index: IndexerJob.java
===================================================================
--- IndexerJob.java (revision 1454771)
+++ IndexerJob.java (working copy)
@@ -125,7 +125,11 @@
protected Job createIndexJob(Configuration conf, String jobName, String
batchId)
throws IOException, ClassNotFoundException {
- conf.set(GeneratorJob.BATCH_ID, batchId);
+
+ if(batchId != null){
+ conf.set(GeneratorJob.BATCH_ID, batchId);
+ }
+
Job job = new NutchJob(conf, jobName);
// TODO: Figure out why this needs to be here
job.getConfiguration().setClass("mapred.output.key.comparator.class",