Denis Haskin wrote:
I've been trying to do some experimentation with nutch 0.7.1 (this is on Windows 2000).

I set things up to crawl a local drive (well, actually a network mapped drive) and it seemed to work fine. I let run for a bit but then aborted it because I wanted to adjust something.

I deleted all the crawl-* directories, but now when I try to run it I am always getting this error:

051004 120331 Updating D:\workspaces\work\nutch-0.7.1\crawl-20051004120328\db Exception in thread "main" java.io.IOException: Impossible condition: directories D:\workspaces\work\nutch-0.7.1\crawl-20051004120328\db\webdb.old and D:\workspaces\work\nutch-0.7.1\crawl-20051004120328\db\webdb cannot exist simultaneously

The complete crawl output is below. I am baffled by why this is happening.

My urls file just has:
file:///d:/

Thanks for any assistance you can provide...

dwh

--- output from crawl ---

D:\workspaces\work\nutch-0.7.1>java -classpath conf;nutch-0.7.1.jar;build\classes;lib;lib\commons-logging-api-1.0.4.jar;lib\concurre nt-1.3.4.jar;lib\jakarta-oro-2.0.7.jar;lib\jetty-5.1.2.jar;lib\junit-3.8.1.jar;lib\lucene-1.9-rc1-dev.jar;lib\lucene-misc-1.9-rc1-de v.jar;lib\servlet-api.jar;lib\taglibs-i18n.jar;lib\taglibs-i18n.tld;lib\xerces-2_6_2-apis.jar;lib\xerces-2_6_2.jar;. org.apache.nutc
h.tools.CrawlTool crawl urls
051004 120328 parsing file:/D:/workspaces/work/nutch-0.7.1/conf/nutch-default.xml 051004 120328 parsing file:/D:/workspaces/work/nutch-0.7.1/conf/crawl-tool.xml 051004 120328 parsing file:/D:/workspaces/work/nutch-0.7.1/conf/nutch-site.xml
051004 120328 No FS indicated, using default:local
051004 120328 crawl started in: crawl-20051004120328
051004 120328 rootUrlFile = urls
051004 120328 threads = 10
051004 120328 depth = 5
051004 120328 Created webdb at LocalFS,D:\workspaces\work\nutch-0.7.1\crawl-20051004120328\db
051004 120328 Starting URL processing
051004 120328 Plugins: looking in: D:\workspaces\work\nutch-0.7.1\plugins
051004 120328 not including: D:\workspaces\work\nutch-0.7.1\plugins\clustering-carrot2 051004 120328 not including: D:\workspaces\work\nutch-0.7.1\plugins\creativecommons 051004 120328 parsing: D:\workspaces\work\nutch-0.7.1\plugins\index-basic\plugin.xml 051004 120328 impl: point=org.apache.nutch.indexer.IndexingFilter class=org.apache.nutch.indexer.basic.BasicIndexingFilter 051004 120328 not including: D:\workspaces\work\nutch-0.7.1\plugins\index-more 051004 120328 not including: D:\workspaces\work\nutch-0.7.1\plugins\language-identifier 051004 120328 parsing: D:\workspaces\work\nutch-0.7.1\plugins\nutch-extensionpoints\plugin.xml 051004 120328 not including: D:\workspaces\work\nutch-0.7.1\plugins\ontology 051004 120328 not including: D:\workspaces\work\nutch-0.7.1\plugins\parse-ext 051004 120328 parsing: D:\workspaces\work\nutch-0.7.1\plugins\parse-html\plugin.xml 051004 120328 impl: point=org.apache.nutch.parse.Parser class=org.apache.nutch.parse.html.HtmlParser 051004 120328 not including: D:\workspaces\work\nutch-0.7.1\plugins\parse-js 051004 120328 not including: D:\workspaces\work\nutch-0.7.1\plugins\parse-msword 051004 120328 not including: D:\workspaces\work\nutch-0.7.1\plugins\parse-pdf 051004 120328 not including: D:\workspaces\work\nutch-0.7.1\plugins\parse-rss 051004 120328 parsing: D:\workspaces\work\nutch-0.7.1\plugins\parse-text\plugin.xml 051004 120328 impl: point=org.apache.nutch.parse.Parser class=org.apache.nutch.parse.text.TextParser 051004 120328 parsing: D:\workspaces\work\nutch-0.7.1\plugins\protocol-file\plugin.xml 051004 120328 impl: point=org.apache.nutch.protocol.Protocol class=org.apache.nutch.protocol.file.File 051004 120328 not including: D:\workspaces\work\nutch-0.7.1\plugins\protocol-ftp 051004 120328 parsing: D:\workspaces\work\nutch-0.7.1\plugins\protocol-http\plugin.xml 051004 120328 impl: point=org.apache.nutch.protocol.Protocol class=org.apache.nutch.protocol.http.Http 051004 120328 not including: D:\workspaces\work\nutch-0.7.1\plugins\protocol-httpclient 051004 120328 parsing: D:\workspaces\work\nutch-0.7.1\plugins\query-basic\plugin.xml 051004 120328 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.basic.BasicQueryFilter 051004 120328 not including: D:\workspaces\work\nutch-0.7.1\plugins\query-more 051004 120328 parsing: D:\workspaces\work\nutch-0.7.1\plugins\query-site\plugin.xml 051004 120328 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.site.SiteQueryFilter 051004 120328 parsing: D:\workspaces\work\nutch-0.7.1\plugins\query-url\plugin.xml 051004 120328 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.url.URLQueryFilter 051004 120328 not including: D:\workspaces\work\nutch-0.7.1\plugins\urlfilter-prefix 051004 120328 parsing: D:\workspaces\work\nutch-0.7.1\plugins\urlfilter-regex\plugin.xml 051004 120328 impl: point=org.apache.nutch.net.URLFilter class=org.apache.nutch.net.RegexURLFilter 051004 120328 found resource crawl-urlfilter.txt at file:/D:/workspaces/work/nutch-0.7.1/conf/crawl-urlfilter.txt 051004 120328 Using URL normalizer: org.apache.nutch.net.BasicUrlNormalizer
051004 120328 Added 1 pages
051004 120328 Processing pagesByURL: Sorted 1 instructions in 0.0 seconds.
051004 120328 Processing pagesByURL: Sorted Infinity instructions/second
051004 120328 Processing pagesByURL: Merged to new DB containing 1 records in 0.0 seconds
051004 120328 Processing pagesByURL: Merged Infinity records/second
051004 120328 Processing pagesByMD5: Sorted 1 instructions in 0.031 seconds. 051004 120328 Processing pagesByMD5: Sorted 32.25806451612903 instructions/second 051004 120328 Processing pagesByMD5: Merged to new DB containing 1 records in 0.0 seconds
051004 120328 Processing pagesByMD5: Merged Infinity records/second
051004 120328 Processing linksByMD5: Copied file (0 bytes) in 0.0 secs.
051004 120328 Processing linksByURL: Copied file (0 bytes) in 0.016 secs.
051004 120328 FetchListTool started
051004 120329 Processing pagesByURL: Sorted 1 instructions in 0.047 seconds. 051004 120329 Processing pagesByURL: Sorted 21.27659574468085 instructions/second 051004 120329 Processing pagesByURL: Merged to new DB containing 1 records in 0.0 seconds
051004 120329 Processing pagesByURL: Merged Infinity records/second
051004 120329 Processing pagesByMD5: Sorted 1 instructions in 0.016 seconds.
051004 120329 Processing pagesByMD5: Sorted 62.5 instructions/second
051004 120329 Processing pagesByMD5: Merged to new DB containing 1 records in 0.0 seconds
051004 120329 Processing pagesByMD5: Merged Infinity records/second
051004 120329 Processing linksByMD5: Copied file (0 bytes) in 0.0 secs.
051004 120329 Processing linksByURL: Copied file (0 bytes) in 0.0 secs.
051004 120329 Processing D:\workspaces\work\nutch-0.7.1\crawl-20051004120328\segments\20051004120328\fetchlist.unsorted: Sorted 1 en
tries in 0.015 seconds.
051004 120329 Processing D:\workspaces\work\nutch-0.7.1\crawl-20051004120328\segments\20051004120328\fetchlist.unsorted: Sorted 66.6
6666666666667 entries/second
051004 120329 Overall processing: Sorted 1 entries in 0.015 seconds.
051004 120329 Overall processing: Sorted 0.015 entries/second
051004 120329 FetchListTool completed
051004 120329 logging at INFO
051004 120329 fetching file:///d:/
051004 120330 status: segment 20051004120328, 1 pages, 0 errors, 11062 bytes, 1000 ms
051004 120330 status: 1.0 pages/s, 86.421875 kb/s, 11062.0 bytes/page
051004 120331 Updating D:\workspaces\work\nutch-0.7.1\crawl-20051004120328\db Exception in thread "main" java.io.IOException: Impossible condition: directories D:\workspaces\work\nutch-0.7.1\crawl-2005100412032 8\db\webdb.old and D:\workspaces\work\nutch-0.7.1\crawl-20051004120328\db\webdb cannot exist simultaneously
       at org.apache.nutch.db.WebDBWriter.<init>(WebDBWriter.java:1484)
       at org.apache.nutch.db.WebDBWriter.<init>(WebDBWriter.java:1457)
at org.apache.nutch.tools.UpdateDatabaseTool.main(UpdateDatabaseTool.java:360)
       at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:141)


.

Just delete D:\workspaces\work\nutch-0.7.1\crawl-20051004120328\db\webdb.old

Gal

Reply via email to