Re: Tried to run Crawl with depth of only 2 and getting IOException

kraman Thu, 21 Jan 2010 04:38:43 -0800

Yes, the Agent Name was empty.  It works now.

Thanks Much.



Nutch Newbie wrote:
> 
> On Wed, Jan 20, 2010 at 7:10 PM, kraman <kirthi.ra...@gmail.com> wrote:
>>
>> kirth...@cerebrum [~/www/nutch]# ./bin/nutch crawl url -dir tinycrawl
>> -depth
>> 2
>> crawl started in: tinycrawl
>> rootUrlDir = url
>> threads = 10
>> depth = 2
>> Injector: starting
>> Injector: crawlDb: tinycrawl/crawldb
>> Injector: urlDir: url
>> Injector: Converting injected urls to crawl db entries.
>> Injector: Merging injected urls into crawl db.
>> Injector: done
>> Generator: Selecting best-scoring urls due for fetch.
>> Generator: starting
>> Generator: segment: tinycrawl/segments/20100120130316
>> Generator: filtering: false
>> Generator: topN: 2147483647
>> Generator: jobtracker is 'local', generating exactly one partition.
>> Generator: Partitioning selected urls by host, for politeness.
>> Generator: done.
>> Fetcher: starting
>> Fetcher: segment: tinycrawl/segments/20100120130316
>> Fetcher: threads: 10
>> fetching http://www.mywebsite.us/
>> fetch of http://www.mywebsite.us/ failed with:
>> java.lang.RuntimeException:
>> Agent name not configured!
> 
> You need to fix nutch config file as per README.
> 
> 
> 
> 
>> Fetcher: done
>> CrawlDb update: starting
>> CrawlDb update: db: tinycrawl/crawldb
>> CrawlDb update: segments: [tinycrawl/segments/20100120130316]
>> CrawlDb update: additions allowed: true
>> CrawlDb update: URL normalizing: true
>> CrawlDb update: URL filtering: true
>> CrawlDb update: Merging segment data into db.
>> CrawlDb update: done
>> Generator: Selecting best-scoring urls due for fetch.
>> Generator: starting
>> Generator: segment: tinycrawl/segments/20100120130323
>> Generator: filtering: false
>> Generator: topN: 2147483647
>> Generator: jobtracker is 'local', generating exactly one partition.
>> Generator: Partitioning selected urls by host, for politeness.
>> Generator: done.
>> Fetcher: starting
>> Fetcher: segment: tinycrawl/segments/20100120130323
>> Fetcher: threads: 10
>> fetching http://www.mywebsite.us/
>> fetch of http://www.mywebsite.us/ failed with:
>> java.lang.RuntimeException:
>> Agent name not configured!
>> Fetcher: done
>> CrawlDb update: starting
>> CrawlDb update: db: tinycrawl/crawldb
>> CrawlDb update: segments: [tinycrawl/segments/20100120130323]
>> CrawlDb update: additions allowed: true
>> CrawlDb update: URL normalizing: true
>> CrawlDb update: URL filtering: true
>> CrawlDb update: Merging segment data into db.
>> CrawlDb update: done
>> LinkDb: starting
>> LinkDb: linkdb: tinycrawl/linkdb
>> LinkDb: URL normalize: true
>> LinkDb: URL filter: true
>> LinkDb: adding segment: tinycrawl/segments/20100120130323
>> LinkDb: adding segment: tinycrawl/segments/20100120130316
>> LinkDb: done
>> Indexer: starting
>> Indexer: linkdb: tinycrawl/linkdb
>> Indexer: adding segment: tinycrawl/segments/20100120130323
>> Indexer: adding segment: tinycrawl/segments/20100120130316
>> Optimizing index.
>> Indexer: done
>> Dedup: starting
>> Dedup: adding indexes in: tinycrawl/indexes
>> Exception in thread "main" java.io.IOException: Job failed!
>>        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
>>        at
>> org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:439)
>>        at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)
>>
>> LogFile gives
>> java.lang.ArrayIndexOutOfBoundsException: -1
>>        at
>> org.apache.lucene.index.MultiReader.isDeleted(MultiReader.java:113)
>>        at
>> org.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDRecordReader.next(DeleteDuplicates.java:176)
>>        at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157)
>>        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
>>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
>>        at
>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126)
>> --
>> View this message in context:
>> http://old.nabble.com/Tried-to-run-Crawl-with-depth-of-only-2-and-getting-IOException-tp27246959p27246959.html
>> Sent from the Nutch - Dev mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: 
http://old.nabble.com/Tried-to-run-Crawl-with-depth-of-only-2-and-getting-IOException-tp27246959p27257065.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.

Re: Tried to run Crawl with depth of only 2 and getting IOException

Reply via email to