Two observations using the nutch 1.1. nightly build

1) Previously I was using nutch 1.0 to crawl successfully, but had
problems w/ parse-pdf. I decided to try nutch 1.1. w/ parse-tika, which
appears to parse all of the 'problem' pdfs that parse-pdf could not
handle. The crawldb and segments directories are created and appear to
be valid. However, the overall crawl does not finish now:

nutch crawl urls/urls -dir crawl -depth 10
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20100415015102]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: filtering: true
Generator: normalizing: true
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Exception in thread "main" java.lang.NullPointerException
        at org.apache.nutch.crawl.Crawl.main(

Nutch 1.0 would complete like this:

nutch crawl urls/urls -dir crawl -depth 10
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=7 - no more URLs to fetch.
LinkDb: starting
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment:
LinkDb: adding segment:
LinkDb: adding segment:
LinkDb: adding segment:
LinkDb: adding segment:
LinkDb: adding segment:
LinkDb: adding segment:
LinkDb: done
Indexer: starting
Indexer: done
Dedup: starting
Dedup: adding indexes in: crawl/indexes
Dedup: done
merging indexes to: crawl/index
done merging
crawl finished: crawl

Any ideas?

2) if there is a 'space' in any component dir then $NUTCH_OPTS is
invalid and causes this problem:

m...@fp:~/Desktop/untitled folder/nutch-2010-04-14_04-00-47/bin> nutch
crawl urls/urls -dir crawl -depth 10 -topN 10
NUTCH_OPTS:  -Dhadoop.log.dir=/home/mag/Desktop/untitled
folder/nutch-2010-04-14_04-00-47/logs -Dhadoop.log.file=hadoop.log
Exception in thread "main" java.lang.NoClassDefFoundError:
Caused by: java.lang.ClassNotFoundException:
        at Method)
        at java.lang.ClassLoader.loadClass(
        at sun.misc.Launcher$AppClassLoader.loadClass(
        at java.lang.ClassLoader.loadClass(
        at java.lang.ClassLoader.loadClassInternal(
Could not find the main class: folder/nutch-2010-04-14_04-00-47/logs.
Program will exit.
m...@fp:~/Desktop/untitled folder/nutch-2010-04-14_04-00-47/bin> 

Obviously the work around is to rename 'untitled folder' to

Thanks, any help w/b appreciated w/ issue #1 above.


Reply via email to