Okay. Have you tried, the 0.8 version. Seems like it is more stable than the 0.7.X. (The one you are using) It is a bit different too.. with Hadoop and nutch being separate.. I had few issues using 0.7X. But nightly-build (0.8), I was upto speed comparatively sooner. I hope this helps.. I am not trying to go away from the problem, just that next release is more stable and more ever, there is no backward compatibility for 0,8X. (That is what I read in one of the mails achieve) You are better off using 0.8.. Thanks Sudhi
[EMAIL PROTECTED] wrote: Hi, sorry for the fumbled reply, I've tried deleting the directory and starting the crawl from scratch a number of times, with very similar results. The system seems to be generating the exception after the fetch block of the output after an apparently arbitrary depth. It leaves the directory with a db folder containing: Mar 2 09:30 dbreadlock Mar 2 09:31 dbwritelock Mar 2 09:30 webdb Mar 2 09:31 webdb.new The webdb.new folder contains: Mar 2 09:30 pagesByURL Mar 2 09:30 stats Mar 2 09:31 tmp I have the following set in my nutch-site.xml file: urlnormalizer.class org.apache.nutch.net.RegexUrlNormalizer Name of the class used to normalize URLs. urlnormalizer.regex.file regex-normalize.xml Name of the config file used by the RegexUrlNormalizer class. http.content.limit -1 The length limit for downloaded content, in bytes. If this value is nonnegative (>=0), content longer than it will be truncated; otherwise, no truncation at all. plugin.includes nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html|pdf)|index-basic|query-(basic|site|url) Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. I don't think any of this should cause the problem. I'm going to try reinstalling and setting everything up again, but if anyone has any idea what the problem might be then please let me know. cheers, Julian. --- sudhendra seshachala wrote: > Delete the folder/database and then re-issue the > crawl command. > The database/folder gets created when Crawl is > used. > I am recent user too... But, I did get the same > message and I corrected by deleting the folder. IF > any one has better ideas, please share. > > Thanks > > [EMAIL PROTECTED] wrote: > Hi, > > I've been experimenting with nutch and lucene, > everything was working fine, but now I'm getting an > exception thrown from the crawl command. > > The command manages a few fetch cycles but then I > get > the following message: > > 060301 161128 status: segment 20060301161046, 38 > pages, 0 errors, 856591 bytes, 41199 ms > 060301 161128 status: 0.92235243 pages/s, 162.43396 > kb/s, 22541.87 bytes/page > 060301 161129 Updating C:\PF\nutch-0.7.1\LIVE\db > 060301 161129 Updating for > C:\PF\nutch-0.7.1\LIVE\segments\20060301161046 > 060301 161129 Processing document 0 > 060301 161130 Finishing update > 060301 161130 Processing pagesByURL: Sorted 952 > instructions in 0.02 seconds. > 060301 161130 Processing pagesByURL: Sorted 47600.0 > instructions/second > java.io.IOException: already exists: > C:\PF\nutch-0.7.1\LIVE\db\webdb.new\pagesByURL > at > org.apache.nutch.io.MapFile$Writer.(MapFile.java:86) > at > org.apache.nutch.db.WebDBWriter$CloseProcessor.closeDown(WebDBWriter.java:549) > at > org.apache.nutch.db.WebDBWriter.close(WebDBWriter.java:1544) > at > org.apache.nutch.tools.UpdateDatabaseTool.close(UpdateDatabaseTool.java:321) > at > org.apache.nutch.tools.UpdateDatabaseTool.main(UpdateDatabaseTool.java:371) > at > org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:141) > Exception in thread "main" > > Does anyone have any ideas what the problem is > likely > to be. I am running nutch 0.7.1 > > thanks, > > > Julian. > > > > Sudhi Seshachala > http://sudhilogs.blogspot.com/ > > > > > --------------------------------- > Yahoo! Mail > Use Photomail to share photos without annoying attachments. Sudhi Seshachala http://sudhilogs.blogspot.com/ --------------------------------- Yahoo! Mail Bring photos to life! New PhotoMail makes sharing a breeze.
