Crawl crashes with java.io.IOException: already exists: 
C:\nutch\crawl.intranet\oct18\db\webdb.new\pagesByURL
-------------------------------------------------------------------------------------------------------------

         Key: NUTCH-117
         URL: http://issues.apache.org/jira/browse/NUTCH-117
     Project: Nutch
        Type: Bug
    Versions: 0.7.1, 0.7, 0.6    
 Environment: Window 2000  P4 1.70GHz 512MB RAM
Java 1.5.0_05

    Reporter: Stephen Cross
    Priority: Critical


I started a crawl using the command line using nutch 0.7.1.

nutch-daemon.sh start crawl urls.txt -dir oct18 -threads 4 -depth 20

After crawling for over 15 hours the crawl crached with the following exception:

051019 050543 status: segment 20051019050438, 30 pages, 0 errors, 1589818 
bytes, 48020 ms
051019 050543 status: 0.6247397 pages/s, 258.65167 kb/s, 52993.934 bytes/page
051019 050544 Updating C:\nutch\crawl.intranet\oct18\db
051019 050544 Updating for C:\nutch\crawl.intranet\oct18\segments\20051019050438
051019 050544 Processing document 0
051019 050544 Finishing update
051019 050544 Processing pagesByURL: Sorted 47 instructions in 0.02 seconds.
051019 050544 Processing pagesByURL: Sorted 2350.0 instructions/second
Exception in thread "main" java.io.IOException: already exists: 
C:\nutch\crawl.intranet\oct18\db\webdb.new\pagesByURL
        at org.apache.nutch.io.MapFile$Writer.<init>(MapFile.java:86)
        at 
org.apache.nutch.db.WebDBWriter$CloseProcessor.closeDown(WebDBWriter.java:549)
        at org.apache.nutch.db.WebDBWriter.close(WebDBWriter.java:1544)
        at 
org.apache.nutch.tools.UpdateDatabaseTool.close(UpdateDatabaseTool.java:321)
        at 
org.apache.nutch.tools.UpdateDatabaseTool.main(UpdateDatabaseTool.java:371)
        at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:141)


This was on the 14th segement from the requested depth of 20. Doing a quick 
Google on the exception brings up a few previous posts with the same error but 
no definitive answer, seems to have been occuring since nutch 0.6.



-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



-------------------------------------------------------
This SF.Net email is sponsored by:
Power Architecture Resource Center: Free content, downloads, discussions,
and more. http://solutions.newsforge.com/ibmarch.tmpl
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to