Re: Problem with crawling using the latest 1.0 trunk

ahammad Mon, 02 Mar 2009 11:41:30 -0800

Hello Sami,

I'm running the nightly build #740 (obtained from:
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/740/).


My configuration is more or less default. These are the only properties that
I changed in nutch-default:

<name>http.agent.name</name> (added the proper name)
<name>http.agent.description</name> (removed the default description)
<name>db.max.outlinks.per.page</name> (changed the value to 35)

These are the changes that I made to the crawl-urlfilter.txt file:
+^http://([a-z0-9]*\.)*mydomain.com/somedirectory/

and I completely removed this line:
# skip URLs containing certain characters as probable queries, etc.
-[...@=]

And finally, to run the crawl, I put the following in cygwin:
$ bin/nutch crawl urls -dir kb -depth 3 >& crawl.log

Thanks for your reply. I really appreciate all the time and effort you and
the other developers and contributers put into this project.

Cheers


Sami Siren-2 wrote:
> 
> Hi,
> 
> and thanks for being persistent. Can you specify what is the version of 
> nutch that you are running, is it a nightly build (if yes, which one?) 
> or did you check out the svn trunk? And just to be sure: you are running 
> with default configuration?
> 
> --
>  Sami Siren
> 
> ahammad wrote:
>> I checked hadoop.log and this is what it has:
>>
>> java.lang.IllegalArgumentException: it doesn't make sense to have a field
>> that is neither indexed nor stored
>>      at org.apache.lucene.document.Field.<init>(Field.java:279)
>>      at
>> org.apache.nutch.indexer.lucene.LuceneWriter.createLuceneDoc(LuceneWriter.java:133)
>>      at
>> org.apache.nutch.indexer.lucene.LuceneWriter.write(LuceneWriter.java:239)
>>      at
>> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:50)
>>      at
>> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:40)
>>      at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:410)
>>      at
>> org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:158)
>>      at
>> org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:50)
>>      at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436)
>>      at
>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:170)
>>
>>
>> I don't understand what that refers to specifically. I'm running it at
>> it's
>> default configuration, without any of the advanced indexing that I have
>> in
>> my 0.9 install.
>>
>> Cheers.
>>
>>
>>
>> Andrzej Bialecki wrote:
>>   
>>> ahammad wrote:
>>>     
>>>> I am aware that this is still a development version, but I need to test
>>>> a
>>>> few
>>>> things with Nutch/Solr so I installed the latest dev version of Nutch
>>>> 1.0.
>>>>
>>>> I tried running a crawl like I did with the working 0.9 version. From
>>>> the
>>>> log, it seems to fetch all the pages properly, but it fails at the
>>>> indexing:
>>>>
>>>> CrawlDb update: starting
>>>> CrawlDb update: db: kb/crawldb
>>>> CrawlDb update: segments: [kb/segments/20090302135858]
>>>> CrawlDb update: additions allowed: true
>>>> CrawlDb update: URL normalizing: true
>>>> CrawlDb update: URL filtering: true
>>>> CrawlDb update: Merging segment data into db.
>>>> CrawlDb update: done
>>>> LinkDb: starting
>>>> LinkDb: linkdb: kb/linkdb
>>>> LinkDb: URL normalize: true
>>>> LinkDb: URL filter: true
>>>> LinkDb: adding segment:
>>>> file:/c:/nutch-2009-03-02_04-01-53/kb/segments/20090302135757
>>>> LinkDb: adding segment:
>>>> file:/c:/nutch-2009-03-02_04-01-53/kb/segments/20090302135807
>>>> LinkDb: adding segment:
>>>> file:/c:/nutch-2009-03-02_04-01-53/kb/segments/20090302135858
>>>> LinkDb: done
>>>> Indexer: starting
>>>> Exception in thread "main" java.io.IOException: Job failed!
>>>>    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
>>>>    at org.apache.nutch.indexer.Indexer.index(Indexer.java:72)
>>>>    at org.apache.nutch.crawl.Crawl.main(Crawl.java:146)
>>>>
>>>>
>>>> I took a look at all the configuration and as far as I can tell, I did
>>>> the
>>>> same thing with my 0.9 install. Could it be that I didn't install it
>>>> properly? I unzipped it and ran ant and ant war in the root directory.
>>>>       
>>> Please check the logs in the logs/ directory - the above message is not 
>>> informative, the real reason of the failure can be found in the logs.
>>>
>>> -- 
>>> Best regards,
>>> Andrzej Bialecki     <><
>>>   ___. ___ ___ ___ _ _   __________________________________
>>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>>> http://www.sigram.com  Contact: info at sigram dot com
>>>
>>>
>>>
>>>     
>>
>>   
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Problem-with-crawling-using-the-latest-1.0-trunk-tp22294581p22295160.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Problem with crawling using the latest 1.0 trunk

Reply via email to