On Wed, Jan 20, 2010 at 7:10 PM, kraman <kirthi.ra...@gmail.com> wrote: > > kirth...@cerebrum [~/www/nutch]# ./bin/nutch crawl url -dir tinycrawl -depth > 2 > crawl started in: tinycrawl > rootUrlDir = url > threads = 10 > depth = 2 > Injector: starting > Injector: crawlDb: tinycrawl/crawldb > Injector: urlDir: url > Injector: Converting injected urls to crawl db entries. > Injector: Merging injected urls into crawl db. > Injector: done > Generator: Selecting best-scoring urls due for fetch. > Generator: starting > Generator: segment: tinycrawl/segments/20100120130316 > Generator: filtering: false > Generator: topN: 2147483647 > Generator: jobtracker is 'local', generating exactly one partition. > Generator: Partitioning selected urls by host, for politeness. > Generator: done. > Fetcher: starting > Fetcher: segment: tinycrawl/segments/20100120130316 > Fetcher: threads: 10 > fetching http://www.mywebsite.us/ > fetch of http://www.mywebsite.us/ failed with: java.lang.RuntimeException: > Agent name not configured!
You need to fix nutch config file as per README. > Fetcher: done > CrawlDb update: starting > CrawlDb update: db: tinycrawl/crawldb > CrawlDb update: segments: [tinycrawl/segments/20100120130316] > CrawlDb update: additions allowed: true > CrawlDb update: URL normalizing: true > CrawlDb update: URL filtering: true > CrawlDb update: Merging segment data into db. > CrawlDb update: done > Generator: Selecting best-scoring urls due for fetch. > Generator: starting > Generator: segment: tinycrawl/segments/20100120130323 > Generator: filtering: false > Generator: topN: 2147483647 > Generator: jobtracker is 'local', generating exactly one partition. > Generator: Partitioning selected urls by host, for politeness. > Generator: done. > Fetcher: starting > Fetcher: segment: tinycrawl/segments/20100120130323 > Fetcher: threads: 10 > fetching http://www.mywebsite.us/ > fetch of http://www.mywebsite.us/ failed with: java.lang.RuntimeException: > Agent name not configured! > Fetcher: done > CrawlDb update: starting > CrawlDb update: db: tinycrawl/crawldb > CrawlDb update: segments: [tinycrawl/segments/20100120130323] > CrawlDb update: additions allowed: true > CrawlDb update: URL normalizing: true > CrawlDb update: URL filtering: true > CrawlDb update: Merging segment data into db. > CrawlDb update: done > LinkDb: starting > LinkDb: linkdb: tinycrawl/linkdb > LinkDb: URL normalize: true > LinkDb: URL filter: true > LinkDb: adding segment: tinycrawl/segments/20100120130323 > LinkDb: adding segment: tinycrawl/segments/20100120130316 > LinkDb: done > Indexer: starting > Indexer: linkdb: tinycrawl/linkdb > Indexer: adding segment: tinycrawl/segments/20100120130323 > Indexer: adding segment: tinycrawl/segments/20100120130316 > Optimizing index. > Indexer: done > Dedup: starting > Dedup: adding indexes in: tinycrawl/indexes > Exception in thread "main" java.io.IOException: Job failed! > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604) > at > org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:439) > at org.apache.nutch.crawl.Crawl.main(Crawl.java:135) > > LogFile gives > java.lang.ArrayIndexOutOfBoundsException: -1 > at > org.apache.lucene.index.MultiReader.isDeleted(MultiReader.java:113) > at > org.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDRecordReader.next(DeleteDuplicates.java:176) > at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126) > -- > View this message in context: > http://old.nabble.com/Tried-to-run-Crawl-with-depth-of-only-2-and-getting-IOException-tp27246959p27246959.html > Sent from the Nutch - Dev mailing list archive at Nabble.com. > >