Re: Not able to crawl local file system: need help

Ismael Thu, 28 Feb 2008 03:38:32 -0800

Hello. I think your problem might be the same exposed (and answered) here:

http://www.nabble.com/Exception-in-DeleteDuplicates.java-td14781941.html


As an advice when you have an error you could google only the error, answers
are easily found in that way if it happens that somebody got the same
problem.

Good luck!


2008/2/28, Garnier Garnier <[EMAIL PROTECTED]>:
>
>
> Nutch experts:
>
> Here's the problem:
>
> 1.      downloaded Nutch 0.9 from site.
> 2.      Modified the required files to crawl on Linux.
> 3.      http crawl successful and index was created.
> 4.      Modified the files to run a local filesystem crawl.
> 5.      Googled to find the following links:
>
>
> http://wiki.apache.org/nutch/FAQ#head-c721b23b43b15885f5ea7d8da62c1c40a37878e6
> http://www.folge2.de/tp/search/a/crawling-the-local-filesystem-with-nutch
> 6.      Modified the files as mentioned.
> 7.      Crawl fails with the following error.
> (The config file format seems to be fine. Not able to debug the error.
> Went through changes.txt and it mentions that the following has been fixed
> :
> 53. NUTCH-384 - Protocol-file plugin does not allow the parse plugins
>     framework to operate properly (Heiko Dietze via mattmann)
>
>
> Not sure why local crawl fails. May I request the experts for help?
>
> Regards,
> Garnier
>
> Error:
> crawl started in: crawl
> rootUrlDir = urls
> threads = 10
> depth = 3
> topN = 1000
> Injector: starting
> Injector: crawlDb: crawl/crawldb
> Injector: urlDir: urls
> Injector: Converting injected urls to crawl db entries.
> Injector: Merging injected urls into crawl db.
> Injector: done
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment: crawl/segments/20080228114148
> Generator: filtering: false
> Generator: topN: 1000
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> Fetcher: starting
> Fetcher: segment: crawl/segments/20080228114148
> Fetcher: threads: 10
> fetching file:///hm/garnier/TOOLTOX/Builds/user-docs
> fetch of file:///hm/garnier/TOOLTOX/Builds/user-docs failed with:
> org.apache.nutch.protocol.ProtocolNotFound: protocol not found for
> url=file
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: crawl/crawldb
> CrawlDb update: segments: [crawl/segments/20080228114148]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: true
> CrawlDb update: URL filtering: true
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment: crawl/segments/20080228114155
> Generator: filtering: false
> Generator: topN: 1000
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> Fetcher: starting
> Fetcher: segment: crawl/segments/20080228114155
> Fetcher: threads: 10
> fetching file:///hm/garnier/TOOLTOX/Builds/user-docs
> fetch of file:///hm/garnier/TOOLTOX/Builds/user-docs failed with:
> org.apache.nutch.protocol.ProtocolNotFound: protocol not found for
> url=file
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: crawl/crawldb
> CrawlDb update: segments: [crawl/segments/20080228114155]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: true
> CrawlDb update: URL filtering: true
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment: crawl/segments/20080228114201
> Generator: filtering: false
> Generator: topN: 1000
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> Fetcher: starting
> Fetcher: segment: crawl/segments/20080228114201
> Fetcher: threads: 10
> fetching file:///hm/garnier/TOOLTOX/Builds/user-docs
> fetch of file:///hm/garnier/TOOLTOX/Builds/user-docs failed with:
> org.apache.nutch.protocol.ProtocolNotFound: protocol not found for
> url=file
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: crawl/crawldb
> CrawlDb update: segments: [crawl/segments/20080228114201]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: true
> CrawlDb update: URL filtering: true
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> LinkDb: starting
> LinkDb: linkdb: crawl/linkdb
> LinkDb: URL normalize: true
> LinkDb: URL filter: true
> LinkDb: adding segment: crawl/segments/20080228114148
> LinkDb: adding segment: crawl/segments/20080228114155
> LinkDb: adding segment: crawl/segments/20080228114201
> LinkDb: done
> Indexer: starting
> Indexer: linkdb: crawl/linkdb
> Indexer: adding segment: crawl/segments/20080228114148
> Indexer: adding segment: crawl/segments/20080228114155
> Indexer: adding segment: crawl/segments/20080228114201
> Optimizing index.
> Indexer: done
> Dedup: starting
> Dedup: adding indexes in: crawl/indexes
> Exception in thread "main" java.io.IOException: Job failed!
>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
>         at org.apache.nutch.indexer.DeleteDuplicates.dedup(
> DeleteDuplicates.java:439)
>         at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)
>
>
>
>
>
>       Share files, take polls, and discuss your passions - all under one
> roof. Go to http://in.promos.yahoo.com/groups
>
>

Re: Not able to crawl local file system: need help

Reply via email to