I am in trunk, trying to do the following

bin/nutch crawl urls -dir crawl.test

where urls contains

http://spack.net/

and conf/crawl-urlfilter.txt contains

+^http://spack.net/

when I run the command I get

Exception in thread "main"
java.lang.NullPointerException
        at
org.apache.nutch.indexer.IndexSegment.indexPages(IndexSegment.java:148)
        at
org.apache.nutch.indexer.IndexSegment.main(IndexSegment.java:262)
        at
org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:153)

I did a little digging and it appears that lang ends
up being null (couldn't quite track down where lang
should have been set).  Not sure if it is a proper
fix, but changing doc.getField("lang").stringValue()
to doc.get("lang"), makes my little crawl complete. 
As did commenting out that LOG.info command.  Like I
say, not sure why lang is null, but if it is going to
be null, probably shouldn't be calling stringValue()
on it.  Guess it didn't like http://spack.net/

Patch follows.

Earl

Index: Index:
trunk/src/java/org/apache/nutch/indexer/IndexSegment.java
===================================================================
---
trunk/src/java/org/apache/nutch/indexer/IndexSegment.java
  (revision 264952)
+++
trunk/src/java/org/apache/nutch/indexer/IndexSegment.java
  (working copy)
@@ -146,7 +146,7 @@
               // add the document to the index
               NutchAnalyzer analyzer =
AnalyzerFactory.get(doc.get("lang"));
               LOG.info(" Indexing [" +
doc.getField("url").stringValue() +
-                       "] with analyzer " + analyzer
+ " (" + doc.getField("lang").stringValue() + ")");
+                       "] with analyzer " + analyzer
+ " (" + doc.get("lang") + ")");
               //LOG.info(" Doc is " + doc);
               writer.addDocument(doc, analyzer);
               if (count > 0 && count % LOG_STEP == 0) {

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to