when doing crawl by Nutch 0.7.2,the following excetion appear,can't I reuse
the crawled db and segments directory?
I don't want to do the same crawl job from the beginning!
please help kindly,many thanks!
061028 112509 Processing pagesByMD5: Merged to new DB containing 6524
records in 0.14 seconds
061028 112509 Processing pagesByMD5: Merged 46599.99999999999 records/second
061028 112510 Processing linksByMD5: Sorted 102630 instructions in 1.875seconds.
061028 112510 Processing linksByMD5: Sorted 54736.0 instructions/second
061028 112515 Processing linksByMD5: Merged to new DB containing 247265
records in 4.234 seconds
061028 112515 Processing linksByMD5: Merged 58399.858290033066records/second
061028 112517 Processing linksByURL: Sorted 90793 instructions in 2.235seconds.
061028 112517 Processing linksByURL: Sorted 40623.26621923938instructions/second
061028 112522 Processing linksByURL: Merged to new DB containing 247265
records in 4.172 seconds
061028 112522 Processing linksByURL: Merged 59267.73729626079 records/second
061028 112524 Processing linksByMD5: Sorted 93700 instructions in 1.625seconds.
061028 112524 Processing linksByMD5: Sorted 57661.53846153846instructions/second
061028 112528 Processing linksByMD5: Merged to new DB containing 247265
records in 3.609 seconds
061028 112528 Processing linksByMD5: Merged 68513.43862565808 records/second
061028 112534 Update finished
061028 112534 FetchListTool started
061028 112535 Processing pagesByURL: Sorted 1606 instructions in 0.032seconds.
061028 112535 Processing pagesByURL: Sorted 50187.5 instructions/second
061028 112535 Processing E:\cygwinxp\nutch-
0.7.2\bin\webjxcw\segments\20061028112534\fetchlist.unsorted: Sorted 1606
entries in 0.046 seconds.
061028 112535 Processing E:\cygwinxp\nutch-
0.7.2\bin\webjxcw\segments\20061028112534\fetchlist.unsorted: Sorted
34913.04347826087 entries/second
061028 112535 Overall processing: Sorted 1606 entries in 0.046 seconds.
061028 112535 Overall processing: Sorted 2.8642590286425904E-5entries/second
Exception in thread "main" java.io.IOException: already exists:
E:\cygwinxp\nutch-0.7.2\bin\webjxcw\db\webdb.new\pagesByURL
at org.apache.nutch.io.MapFile$Writer.<init>(MapFile.java:86)
at org.apache.nutch.db.WebDBWriter$CloseProcessor.closeDown(
WebDBWriter.java:549)
at org.apache.nutch.db.WebDBWriter.close(WebDBWriter.java:1544)
at org.apache.nutch.tools.FetchListTool.emitFetchList(FetchListTool.java
:499)
at org.apache.nutch.tools.FetchListTool.emitFetchList(FetchListTool.java
:319)
at org.apache.nutch.tools.FetchListTool.main(FetchListTool.java:593)
at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:138)
--
kevin
-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general