Two observations using the nutch 1.1. nightly build
nutch-2010-04-14_04-00-47:

1) I was using nutch 1.0 to crawl successfully, but had problems w/
parse-pdf. I decided to try nutch 1.1. w/ parse-tika, which appears to
parse all of the 'problem' pdfs that parse-pdf could not handle. The
crawldb and segments directories are created and appear to be valid.
However, the overall crawl does not finish now:

nutch crawl urls/urls -dir crawl -depth 10
...
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20100415015102]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: filtering: true
Generator: normalizing: true
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Exception in thread "main" java.lang.NullPointerException
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:133)

Nutch 1.0 would complete like this:

nutch crawl urls/urls -dir crawl -depth 10
...
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=7 - no more URLs to fetch.
LinkDb: starting
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment:
file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225731
LinkDb: adding segment:
file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225644
LinkDb: adding segment:
file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225749
LinkDb: adding segment:
file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225808
LinkDb: adding segment:
file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225713
LinkDb: adding segment:
file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225937
LinkDb: adding segment:
file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225656
LinkDb: done
Indexer: starting
Indexer: done
Dedup: starting
Dedup: adding indexes in: crawl/indexes
Dedup: done
merging indexes to: crawl/index
Adding
file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/indexes/part-00000
done merging
crawl finished: crawl

Any ideas?


2) if there is a 'space' in any component dir then $NUTCH_OPTS is
invalid and causes this problem:

m...@fp:~/Desktop/untitled folder/nutch-2010-04-14_04-00-47/bin> nutch
crawl urls/urls -dir crawl -depth 10 -topN 10
NUTCH_OPTS:  -Dhadoop.log.dir=/home/mag/Desktop/untitled
folder/nutch-2010-04-14_04-00-47/logs -Dhadoop.log.file=hadoop.log
-Djava.library.path=/home/mag/Desktop/untitled
folder/nutch-2010-04-14_04-00-47/lib/native/Linux-i386-32
Exception in thread "main" java.lang.NoClassDefFoundError:
folder/nutch-2010-04-14_04-00-47/logs
Caused by: java.lang.ClassNotFoundException:
folder.nutch-2010-04-14_04-00-47.logs
        at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
        at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
Could not find the main class: folder/nutch-2010-04-14_04-00-47/logs.
Program will exit.
m...@fp:~/Desktop/untitled folder/nutch-2010-04-14_04-00-47/bin> 

Obviously the work around is to rename 'untitled folder' to
'untitledFolderWithNoSpaces'

Thanks, any help w/b appreciated w/ issue #1 above.

-m.

Reply via email to