Re: Problem with crawling using the latest 1.0 trunk

ahammad Mon, 02 Mar 2009 11:24:35 -0800

I checked hadoop.log and this is what it has:

java.lang.IllegalArgumentException: it doesn't make sense to have a field
that is neither indexed nor stored
        at org.apache.lucene.document.Field.<init>(Field.java:279)
        at
org.apache.nutch.indexer.lucene.LuceneWriter.createLuceneDoc(LuceneWriter.java:133)
        at
org.apache.nutch.indexer.lucene.LuceneWriter.write(LuceneWriter.java:239)
        at
org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:50)
        at
org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:40)
        at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:410)
        at
org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:158)
        at
org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:50)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436)
        at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:170)



I don't understand what that refers to specifically. I'm running it at it's
default configuration, without any of the advanced indexing that I have in
my 0.9 install.

Cheers.



Andrzej Bialecki wrote:
> 
> ahammad wrote:
>> I am aware that this is still a development version, but I need to test a
>> few
>> things with Nutch/Solr so I installed the latest dev version of Nutch
>> 1.0.
>> 
>> I tried running a crawl like I did with the working 0.9 version. From the
>> log, it seems to fetch all the pages properly, but it fails at the
>> indexing:
>> 
>> CrawlDb update: starting
>> CrawlDb update: db: kb/crawldb
>> CrawlDb update: segments: [kb/segments/20090302135858]
>> CrawlDb update: additions allowed: true
>> CrawlDb update: URL normalizing: true
>> CrawlDb update: URL filtering: true
>> CrawlDb update: Merging segment data into db.
>> CrawlDb update: done
>> LinkDb: starting
>> LinkDb: linkdb: kb/linkdb
>> LinkDb: URL normalize: true
>> LinkDb: URL filter: true
>> LinkDb: adding segment:
>> file:/c:/nutch-2009-03-02_04-01-53/kb/segments/20090302135757
>> LinkDb: adding segment:
>> file:/c:/nutch-2009-03-02_04-01-53/kb/segments/20090302135807
>> LinkDb: adding segment:
>> file:/c:/nutch-2009-03-02_04-01-53/kb/segments/20090302135858
>> LinkDb: done
>> Indexer: starting
>> Exception in thread "main" java.io.IOException: Job failed!
>>      at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
>>      at org.apache.nutch.indexer.Indexer.index(Indexer.java:72)
>>      at org.apache.nutch.crawl.Crawl.main(Crawl.java:146)
>> 
>> 
>> I took a look at all the configuration and as far as I can tell, I did
>> the
>> same thing with my 0.9 install. Could it be that I didn't install it
>> properly? I unzipped it and ran ant and ant war in the root directory.
> 
> Please check the logs in the logs/ directory - the above message is not 
> informative, the real reason of the failure can be found in the logs.
> 
> -- 
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Problem-with-crawling-using-the-latest-1.0-trunk-tp22294581p22294829.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Problem with crawling using the latest 1.0 trunk

Reply via email to