Two observations using the nutch 1.1. nightly build nutch-2010-04-14_04-00-47:
1) I was using nutch 1.0 to crawl successfully, but had problems w/ parse-pdf. I decided to try nutch 1.1. w/ parse-tika, which appears to parse all of the 'problem' pdfs that parse-pdf could not handle. The crawldb and segments directories are created and appear to be valid. However, the overall crawl does not finish now: nutch crawl urls/urls -dir crawl -depth 10 ... Fetcher: done CrawlDb update: starting CrawlDb update: db: crawl/crawldb CrawlDb update: segments: [crawl/segments/20100415015102] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: Merging segment data into db. CrawlDb update: done Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: filtering: true Generator: normalizing: true Generator: jobtracker is 'local', generating exactly one partition. Generator: 0 records selected for fetching, exiting ... Exception in thread "main" java.lang.NullPointerException at org.apache.nutch.crawl.Crawl.main(Crawl.java:133) Nutch 1.0 would complete like this: nutch crawl urls/urls -dir crawl -depth 10 ... Generator: 0 records selected for fetching, exiting ... Stopping at depth=7 - no more URLs to fetch. LinkDb: starting LinkDb: linkdb: crawl/linkdb LinkDb: URL normalize: true LinkDb: URL filter: true LinkDb: adding segment: file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225731 LinkDb: adding segment: file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225644 LinkDb: adding segment: file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225749 LinkDb: adding segment: file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225808 LinkDb: adding segment: file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225713 LinkDb: adding segment: file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225937 LinkDb: adding segment: file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225656 LinkDb: done Indexer: starting Indexer: done Dedup: starting Dedup: adding indexes in: crawl/indexes Dedup: done merging indexes to: crawl/index Adding file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/indexes/part-00000 done merging crawl finished: crawl Any ideas? 2) if there is a 'space' in any component dir then $NUTCH_OPTS is invalid and causes this problem: m...@fp:~/Desktop/untitled folder/nutch-2010-04-14_04-00-47/bin> nutch crawl urls/urls -dir crawl -depth 10 -topN 10 NUTCH_OPTS: -Dhadoop.log.dir=/home/mag/Desktop/untitled folder/nutch-2010-04-14_04-00-47/logs -Dhadoop.log.file=hadoop.log -Djava.library.path=/home/mag/Desktop/untitled folder/nutch-2010-04-14_04-00-47/lib/native/Linux-i386-32 Exception in thread "main" java.lang.NoClassDefFoundError: folder/nutch-2010-04-14_04-00-47/logs Caused by: java.lang.ClassNotFoundException: folder.nutch-2010-04-14_04-00-47.logs at java.net.URLClassLoader$1.run(URLClassLoader.java:200) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:188) at java.lang.ClassLoader.loadClass(ClassLoader.java:307) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:252) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320) Could not find the main class: folder/nutch-2010-04-14_04-00-47/logs. Program will exit. m...@fp:~/Desktop/untitled folder/nutch-2010-04-14_04-00-47/bin> Obviously the work around is to rename 'untitled folder' to 'untitledFolderWithNoSpaces' Thanks, any help w/b appreciated w/ issue #1 above. -m.