Hi, sorry for the fumbled reply, I've tried deleting the directory and starting the crawl from scratch a number of times, with very similar results.
The system seems to be generating the exception after the fetch block of the output after an apparently arbitrary depth. It leaves the directory with a db folder containing: Mar 2 09:30 dbreadlock Mar 2 09:31 dbwritelock Mar 2 09:30 webdb Mar 2 09:31 webdb.new The webdb.new folder contains: Mar 2 09:30 pagesByURL Mar 2 09:30 stats Mar 2 09:31 tmp I have the following set in my nutch-site.xml file: <property> <name>urlnormalizer.class</name> <value>org.apache.nutch.net.RegexUrlNormalizer</value> <description>Name of the class used to normalize URLs.</description> </property> <property> <name>urlnormalizer.regex.file</name> <value>regex-normalize.xml</value> <description>Name of the config file used by the RegexUrlNormalizer class.</description> </property> <property> <name>http.content.limit</name> <value>-1</value> <description>The length limit for downloaded content, in bytes. If this value is nonnegative (>=0), content longer than it will be truncated; otherwise, no truncation at all. </description> </property> <property> <name>plugin.includes</name> <value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html|pdf)|index-basic|query-(basic|site|url)</value> <description>Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. </description> </property> I don't think any of this should cause the problem. I'm going to try reinstalling and setting everything up again, but if anyone has any idea what the problem might be then please let me know. cheers, Julian. --- sudhendra seshachala <[EMAIL PROTECTED]> wrote: > Delete the folder/database and then re-issue the > crawl command. > The database/folder gets created when Crawl is > used. > I am recent user too... But, I did get the same > message and I corrected by deleting the folder. IF > any one has better ideas, please share. > > Thanks > > [EMAIL PROTECTED] wrote: > Hi, > > I've been experimenting with nutch and lucene, > everything was working fine, but now I'm getting an > exception thrown from the crawl command. > > The command manages a few fetch cycles but then I > get > the following message: > > 060301 161128 status: segment 20060301161046, 38 > pages, 0 errors, 856591 bytes, 41199 ms > 060301 161128 status: 0.92235243 pages/s, 162.43396 > kb/s, 22541.87 bytes/page > 060301 161129 Updating C:\PF\nutch-0.7.1\LIVE\db > 060301 161129 Updating for > C:\PF\nutch-0.7.1\LIVE\segments\20060301161046 > 060301 161129 Processing document 0 > 060301 161130 Finishing update > 060301 161130 Processing pagesByURL: Sorted 952 > instructions in 0.02 seconds. > 060301 161130 Processing pagesByURL: Sorted 47600.0 > instructions/second > java.io.IOException: already exists: > C:\PF\nutch-0.7.1\LIVE\db\webdb.new\pagesByURL > at > org.apache.nutch.io.MapFile$Writer.(MapFile.java:86) > at > org.apache.nutch.db.WebDBWriter$CloseProcessor.closeDown(WebDBWriter.java:549) > at > org.apache.nutch.db.WebDBWriter.close(WebDBWriter.java:1544) > at > org.apache.nutch.tools.UpdateDatabaseTool.close(UpdateDatabaseTool.java:321) > at > org.apache.nutch.tools.UpdateDatabaseTool.main(UpdateDatabaseTool.java:371) > at > org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:141) > Exception in thread "main" > > Does anyone have any ideas what the problem is > likely > to be. I am running nutch 0.7.1 > > thanks, > > > Julian. > > > > Sudhi Seshachala > http://sudhilogs.blogspot.com/ > > > > > --------------------------------- > Yahoo! Mail > Use Photomail to share photos without annoying attachments. ------------------------------------------------------- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
