Not able to crawl local file system: need help

Garnier Garnier Wed, 27 Feb 2008 22:41:08 -0800

Nutch experts: 

Here’s the problem:


1.      downloaded Nutch 0.9 from site.
2.      Modified the required files to crawl on Linux.
3.      http crawl successful and index was created.
4.      Modified the files to run a local filesystem crawl.
5.      Googled to find the following links:

http://wiki.apache.org/nutch/FAQ#head-c721b23b43b15885f5ea7d8da62c1c40a37878e6
http://www.folge2.de/tp/search/a/crawling-the-local-filesystem-with-nutch
6.      Modified the files as mentioned. 
7.      Crawl fails with the following error. 
(The config file format seems to be fine. Not able to debug the error. Went 
through changes.txt and it mentions that the following has been fixed :
53. NUTCH-384 - Protocol-file plugin does not allow the parse plugins
    framework to operate properly (Heiko Dietze via mattmann)


Not sure why local crawl fails. May I request the experts for help?

Regards,
Garnier

Error:
crawl started in: crawl
rootUrlDir = urls
threads = 10
depth = 3
topN = 1000
Injector: starting
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl/segments/20080228114148
Generator: filtering: false
Generator: topN: 1000
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: crawl/segments/20080228114148
Fetcher: threads: 10
fetching file:///hm/garnier/TOOLTOX/Builds/user-docs
fetch of file:///hm/garnier/TOOLTOX/Builds/user-docs failed with: 
org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=file
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20080228114148]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl/segments/20080228114155
Generator: filtering: false
Generator: topN: 1000
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: crawl/segments/20080228114155
Fetcher: threads: 10
fetching file:///hm/garnier/TOOLTOX/Builds/user-docs
fetch of file:///hm/garnier/TOOLTOX/Builds/user-docs failed with: 
org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=file
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20080228114155]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl/segments/20080228114201
Generator: filtering: false
Generator: topN: 1000
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: crawl/segments/20080228114201
Fetcher: threads: 10
fetching file:///hm/garnier/TOOLTOX/Builds/user-docs
fetch of file:///hm/garnier/TOOLTOX/Builds/user-docs failed with: 
org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=file
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20080228114201]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
LinkDb: starting
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: crawl/segments/20080228114148
LinkDb: adding segment: crawl/segments/20080228114155
LinkDb: adding segment: crawl/segments/20080228114201
LinkDb: done
Indexer: starting
Indexer: linkdb: crawl/linkdb
Indexer: adding segment: crawl/segments/20080228114148
Indexer: adding segment: crawl/segments/20080228114155
Indexer: adding segment: crawl/segments/20080228114201
Optimizing index.
Indexer: done
Dedup: starting
Dedup: adding indexes in: crawl/indexes
Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
        at 
org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:439)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)





      Share files, take polls, and discuss your passions - all under one roof. 
Go to http://in.promos.yahoo.com/groups

Not able to crawl local file system: need help

Reply via email to