Nutch experts: Here’s the problem:
1. downloaded Nutch 0.9 from site. 2. Modified the required files to crawl on Linux. 3. http crawl successful and index was created. 4. Modified the files to run a local filesystem crawl. 5. Googled to find the following links: http://wiki.apache.org/nutch/FAQ#head-c721b23b43b15885f5ea7d8da62c1c40a37878e6 http://www.folge2.de/tp/search/a/crawling-the-local-filesystem-with-nutch 6. Modified the files as mentioned. 7. Crawl fails with the following error. (The config file format seems to be fine. Not able to debug the error. Went through changes.txt and it mentions that the following has been fixed : 53. NUTCH-384 - Protocol-file plugin does not allow the parse plugins framework to operate properly (Heiko Dietze via mattmann) Not sure why local crawl fails. May I request the experts for help? Regards, Garnier Error: crawl started in: crawl rootUrlDir = urls threads = 10 depth = 3 topN = 1000 Injector: starting Injector: crawlDb: crawl/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: done Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: segment: crawl/segments/20080228114148 Generator: filtering: false Generator: topN: 1000 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls by host, for politeness. Generator: done. Fetcher: starting Fetcher: segment: crawl/segments/20080228114148 Fetcher: threads: 10 fetching file:///hm/garnier/TOOLTOX/Builds/user-docs fetch of file:///hm/garnier/TOOLTOX/Builds/user-docs failed with: org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=file Fetcher: done CrawlDb update: starting CrawlDb update: db: crawl/crawldb CrawlDb update: segments: [crawl/segments/20080228114148] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: Merging segment data into db. CrawlDb update: done Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: segment: crawl/segments/20080228114155 Generator: filtering: false Generator: topN: 1000 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls by host, for politeness. Generator: done. Fetcher: starting Fetcher: segment: crawl/segments/20080228114155 Fetcher: threads: 10 fetching file:///hm/garnier/TOOLTOX/Builds/user-docs fetch of file:///hm/garnier/TOOLTOX/Builds/user-docs failed with: org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=file Fetcher: done CrawlDb update: starting CrawlDb update: db: crawl/crawldb CrawlDb update: segments: [crawl/segments/20080228114155] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: Merging segment data into db. CrawlDb update: done Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: segment: crawl/segments/20080228114201 Generator: filtering: false Generator: topN: 1000 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls by host, for politeness. Generator: done. Fetcher: starting Fetcher: segment: crawl/segments/20080228114201 Fetcher: threads: 10 fetching file:///hm/garnier/TOOLTOX/Builds/user-docs fetch of file:///hm/garnier/TOOLTOX/Builds/user-docs failed with: org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=file Fetcher: done CrawlDb update: starting CrawlDb update: db: crawl/crawldb CrawlDb update: segments: [crawl/segments/20080228114201] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: Merging segment data into db. CrawlDb update: done LinkDb: starting LinkDb: linkdb: crawl/linkdb LinkDb: URL normalize: true LinkDb: URL filter: true LinkDb: adding segment: crawl/segments/20080228114148 LinkDb: adding segment: crawl/segments/20080228114155 LinkDb: adding segment: crawl/segments/20080228114201 LinkDb: done Indexer: starting Indexer: linkdb: crawl/linkdb Indexer: adding segment: crawl/segments/20080228114148 Indexer: adding segment: crawl/segments/20080228114155 Indexer: adding segment: crawl/segments/20080228114201 Optimizing index. Indexer: done Dedup: starting Dedup: adding indexes in: crawl/indexes Exception in thread "main" java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604) at org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:439) at org.apache.nutch.crawl.Crawl.main(Crawl.java:135) Share files, take polls, and discuss your passions - all under one roof. Go to http://in.promos.yahoo.com/groups
