[Nutch-dev] [jira] Commented: (NUTCH-117) Crawl crashes with java.io.IOException: already exists: C:\nutch\crawl.intranet\oct18\db\webdb.new\pagesByURL

Mike Alulin (JIRA) Tue, 24 Jan 2006 15:41:08 -0800

    [ 
http://issues.apache.org/jira/browse/NUTCH-117?page=comments#action_12363898 ]


Mike Alulin commented on NUTCH-117:
-----------------------------------

I have same issue in my new production system, although same code works on dev 
and old production without any problems. 

The solution for this bug is uncommenting "pageDb.close();" in the 
WebDBWriter.java file. Otherwise the reader locks the webdb.new\pagesByURL\data 
file and it cannot be deleted sometimes.

> Crawl crashes with java.io.IOException: already exists: 
> C:\nutch\crawl.intranet\oct18\db\webdb.new\pagesByURL
> -------------------------------------------------------------------------------------------------------------
>
>          Key: NUTCH-117
>          URL: http://issues.apache.org/jira/browse/NUTCH-117
>      Project: Nutch
>         Type: Bug
>     Versions: 0.7, 0.6, 0.7.1
>  Environment: Window 2000  P4 1.70GHz 512MB RAM
> Java 1.5.0_05
>     Reporter: Stephen Cross
>     Priority: Critical

>
> I started a crawl using the command line using nutch 0.7.1.
> nutch-daemon.sh start crawl urls.txt -dir oct18 -threads 4 -depth 20
> After crawling for over 15 hours the crawl crached with the following 
> exception:
> 051019 050543 status: segment 20051019050438, 30 pages, 0 errors, 1589818 
> bytes, 48020 ms
> 051019 050543 status: 0.6247397 pages/s, 258.65167 kb/s, 52993.934 bytes/page
> 051019 050544 Updating C:\nutch\crawl.intranet\oct18\db
> 051019 050544 Updating for 
> C:\nutch\crawl.intranet\oct18\segments\20051019050438
> 051019 050544 Processing document 0
> 051019 050544 Finishing update
> 051019 050544 Processing pagesByURL: Sorted 47 instructions in 0.02 seconds.
> 051019 050544 Processing pagesByURL: Sorted 2350.0 instructions/second
> Exception in thread "main" java.io.IOException: already exists: 
> C:\nutch\crawl.intranet\oct18\db\webdb.new\pagesByURL
>         at org.apache.nutch.io.MapFile$Writer.<init>(MapFile.java:86)
>         at 
> org.apache.nutch.db.WebDBWriter$CloseProcessor.closeDown(WebDBWriter.java:549)
>         at org.apache.nutch.db.WebDBWriter.close(WebDBWriter.java:1544)
>         at 
> org.apache.nutch.tools.UpdateDatabaseTool.close(UpdateDatabaseTool.java:321)
>         at 
> org.apache.nutch.tools.UpdateDatabaseTool.main(UpdateDatabaseTool.java:371)
>         at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:141)
> This was on the 14th segement from the requested depth of 20. Doing a quick 
> Google on the exception brings up a few previous posts with the same error 
> but no definitive answer, seems to have been occuring since nutch 0.6.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] [jira] Commented: (NUTCH-117) Crawl crashes with java.io.IOException: already exists: C:\nutch\crawl.intranet\oct18\db\webdb.new\pagesByURL

Reply via email to