Hello Sami, I'm running the nightly build #740 (obtained from: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/740/).
My configuration is more or less default. These are the only properties that I changed in nutch-default: <name>http.agent.name</name> (added the proper name) <name>http.agent.description</name> (removed the default description) <name>db.max.outlinks.per.page</name> (changed the value to 35) These are the changes that I made to the crawl-urlfilter.txt file: +^http://([a-z0-9]*\.)*mydomain.com/somedirectory/ and I completely removed this line: # skip URLs containing certain characters as probable queries, etc. -[...@=] And finally, to run the crawl, I put the following in cygwin: $ bin/nutch crawl urls -dir kb -depth 3 >& crawl.log Thanks for your reply. I really appreciate all the time and effort you and the other developers and contributers put into this project. Cheers Sami Siren-2 wrote: > > Hi, > > and thanks for being persistent. Can you specify what is the version of > nutch that you are running, is it a nightly build (if yes, which one?) > or did you check out the svn trunk? And just to be sure: you are running > with default configuration? > > -- > Sami Siren > > ahammad wrote: >> I checked hadoop.log and this is what it has: >> >> java.lang.IllegalArgumentException: it doesn't make sense to have a field >> that is neither indexed nor stored >> at org.apache.lucene.document.Field.<init>(Field.java:279) >> at >> org.apache.nutch.indexer.lucene.LuceneWriter.createLuceneDoc(LuceneWriter.java:133) >> at >> org.apache.nutch.indexer.lucene.LuceneWriter.write(LuceneWriter.java:239) >> at >> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:50) >> at >> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:40) >> at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:410) >> at >> org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:158) >> at >> org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:50) >> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436) >> at >> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:170) >> >> >> I don't understand what that refers to specifically. I'm running it at >> it's >> default configuration, without any of the advanced indexing that I have >> in >> my 0.9 install. >> >> Cheers. >> >> >> >> Andrzej Bialecki wrote: >> >>> ahammad wrote: >>> >>>> I am aware that this is still a development version, but I need to test >>>> a >>>> few >>>> things with Nutch/Solr so I installed the latest dev version of Nutch >>>> 1.0. >>>> >>>> I tried running a crawl like I did with the working 0.9 version. From >>>> the >>>> log, it seems to fetch all the pages properly, but it fails at the >>>> indexing: >>>> >>>> CrawlDb update: starting >>>> CrawlDb update: db: kb/crawldb >>>> CrawlDb update: segments: [kb/segments/20090302135858] >>>> CrawlDb update: additions allowed: true >>>> CrawlDb update: URL normalizing: true >>>> CrawlDb update: URL filtering: true >>>> CrawlDb update: Merging segment data into db. >>>> CrawlDb update: done >>>> LinkDb: starting >>>> LinkDb: linkdb: kb/linkdb >>>> LinkDb: URL normalize: true >>>> LinkDb: URL filter: true >>>> LinkDb: adding segment: >>>> file:/c:/nutch-2009-03-02_04-01-53/kb/segments/20090302135757 >>>> LinkDb: adding segment: >>>> file:/c:/nutch-2009-03-02_04-01-53/kb/segments/20090302135807 >>>> LinkDb: adding segment: >>>> file:/c:/nutch-2009-03-02_04-01-53/kb/segments/20090302135858 >>>> LinkDb: done >>>> Indexer: starting >>>> Exception in thread "main" java.io.IOException: Job failed! >>>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232) >>>> at org.apache.nutch.indexer.Indexer.index(Indexer.java:72) >>>> at org.apache.nutch.crawl.Crawl.main(Crawl.java:146) >>>> >>>> >>>> I took a look at all the configuration and as far as I can tell, I did >>>> the >>>> same thing with my 0.9 install. Could it be that I didn't install it >>>> properly? I unzipped it and ran ant and ant war in the root directory. >>>> >>> Please check the logs in the logs/ directory - the above message is not >>> informative, the real reason of the failure can be found in the logs. >>> >>> -- >>> Best regards, >>> Andrzej Bialecki <>< >>> ___. ___ ___ ___ _ _ __________________________________ >>> [__ || __|__/|__||\/| Information Retrieval, Semantic Web >>> ___|||__|| \| || | Embedded Unix, System Integration >>> http://www.sigram.com Contact: info at sigram dot com >>> >>> >>> >>> >> >> > > > -- View this message in context: http://www.nabble.com/Problem-with-crawling-using-the-latest-1.0-trunk-tp22294581p22295160.html Sent from the Nutch - User mailing list archive at Nabble.com.
