Denis Haskin wrote:
I've been trying to do some experimentation with nutch 0.7.1 (this is
on Windows 2000).
I set things up to crawl a local drive (well, actually a network
mapped drive) and it seemed to work fine. I let run for a bit but
then aborted it because I wanted to adjust something.
I deleted all the crawl-* directories, but now when I try to run it I
am always getting this error:
051004 120331 Updating
D:\workspaces\work\nutch-0.7.1\crawl-20051004120328\db
Exception in thread "main" java.io.IOException: Impossible condition:
directories
D:\workspaces\work\nutch-0.7.1\crawl-20051004120328\db\webdb.old and
D:\workspaces\work\nutch-0.7.1\crawl-20051004120328\db\webdb cannot
exist simultaneously
The complete crawl output is below. I am baffled by why this is
happening.
My urls file just has:
file:///d:/
Thanks for any assistance you can provide...
dwh
--- output from crawl ---
D:\workspaces\work\nutch-0.7.1>java -classpath
conf;nutch-0.7.1.jar;build\classes;lib;lib\commons-logging-api-1.0.4.jar;lib\concurre
nt-1.3.4.jar;lib\jakarta-oro-2.0.7.jar;lib\jetty-5.1.2.jar;lib\junit-3.8.1.jar;lib\lucene-1.9-rc1-dev.jar;lib\lucene-misc-1.9-rc1-de
v.jar;lib\servlet-api.jar;lib\taglibs-i18n.jar;lib\taglibs-i18n.tld;lib\xerces-2_6_2-apis.jar;lib\xerces-2_6_2.jar;.
org.apache.nutc
h.tools.CrawlTool crawl urls
051004 120328 parsing
file:/D:/workspaces/work/nutch-0.7.1/conf/nutch-default.xml
051004 120328 parsing
file:/D:/workspaces/work/nutch-0.7.1/conf/crawl-tool.xml
051004 120328 parsing
file:/D:/workspaces/work/nutch-0.7.1/conf/nutch-site.xml
051004 120328 No FS indicated, using default:local
051004 120328 crawl started in: crawl-20051004120328
051004 120328 rootUrlFile = urls
051004 120328 threads = 10
051004 120328 depth = 5
051004 120328 Created webdb at
LocalFS,D:\workspaces\work\nutch-0.7.1\crawl-20051004120328\db
051004 120328 Starting URL processing
051004 120328 Plugins: looking in: D:\workspaces\work\nutch-0.7.1\plugins
051004 120328 not including:
D:\workspaces\work\nutch-0.7.1\plugins\clustering-carrot2
051004 120328 not including:
D:\workspaces\work\nutch-0.7.1\plugins\creativecommons
051004 120328 parsing:
D:\workspaces\work\nutch-0.7.1\plugins\index-basic\plugin.xml
051004 120328 impl: point=org.apache.nutch.indexer.IndexingFilter
class=org.apache.nutch.indexer.basic.BasicIndexingFilter
051004 120328 not including:
D:\workspaces\work\nutch-0.7.1\plugins\index-more
051004 120328 not including:
D:\workspaces\work\nutch-0.7.1\plugins\language-identifier
051004 120328 parsing:
D:\workspaces\work\nutch-0.7.1\plugins\nutch-extensionpoints\plugin.xml
051004 120328 not including:
D:\workspaces\work\nutch-0.7.1\plugins\ontology
051004 120328 not including:
D:\workspaces\work\nutch-0.7.1\plugins\parse-ext
051004 120328 parsing:
D:\workspaces\work\nutch-0.7.1\plugins\parse-html\plugin.xml
051004 120328 impl: point=org.apache.nutch.parse.Parser
class=org.apache.nutch.parse.html.HtmlParser
051004 120328 not including:
D:\workspaces\work\nutch-0.7.1\plugins\parse-js
051004 120328 not including:
D:\workspaces\work\nutch-0.7.1\plugins\parse-msword
051004 120328 not including:
D:\workspaces\work\nutch-0.7.1\plugins\parse-pdf
051004 120328 not including:
D:\workspaces\work\nutch-0.7.1\plugins\parse-rss
051004 120328 parsing:
D:\workspaces\work\nutch-0.7.1\plugins\parse-text\plugin.xml
051004 120328 impl: point=org.apache.nutch.parse.Parser
class=org.apache.nutch.parse.text.TextParser
051004 120328 parsing:
D:\workspaces\work\nutch-0.7.1\plugins\protocol-file\plugin.xml
051004 120328 impl: point=org.apache.nutch.protocol.Protocol
class=org.apache.nutch.protocol.file.File
051004 120328 not including:
D:\workspaces\work\nutch-0.7.1\plugins\protocol-ftp
051004 120328 parsing:
D:\workspaces\work\nutch-0.7.1\plugins\protocol-http\plugin.xml
051004 120328 impl: point=org.apache.nutch.protocol.Protocol
class=org.apache.nutch.protocol.http.Http
051004 120328 not including:
D:\workspaces\work\nutch-0.7.1\plugins\protocol-httpclient
051004 120328 parsing:
D:\workspaces\work\nutch-0.7.1\plugins\query-basic\plugin.xml
051004 120328 impl: point=org.apache.nutch.searcher.QueryFilter
class=org.apache.nutch.searcher.basic.BasicQueryFilter
051004 120328 not including:
D:\workspaces\work\nutch-0.7.1\plugins\query-more
051004 120328 parsing:
D:\workspaces\work\nutch-0.7.1\plugins\query-site\plugin.xml
051004 120328 impl: point=org.apache.nutch.searcher.QueryFilter
class=org.apache.nutch.searcher.site.SiteQueryFilter
051004 120328 parsing:
D:\workspaces\work\nutch-0.7.1\plugins\query-url\plugin.xml
051004 120328 impl: point=org.apache.nutch.searcher.QueryFilter
class=org.apache.nutch.searcher.url.URLQueryFilter
051004 120328 not including:
D:\workspaces\work\nutch-0.7.1\plugins\urlfilter-prefix
051004 120328 parsing:
D:\workspaces\work\nutch-0.7.1\plugins\urlfilter-regex\plugin.xml
051004 120328 impl: point=org.apache.nutch.net.URLFilter
class=org.apache.nutch.net.RegexURLFilter
051004 120328 found resource crawl-urlfilter.txt at
file:/D:/workspaces/work/nutch-0.7.1/conf/crawl-urlfilter.txt
051004 120328 Using URL normalizer:
org.apache.nutch.net.BasicUrlNormalizer
051004 120328 Added 1 pages
051004 120328 Processing pagesByURL: Sorted 1 instructions in 0.0
seconds.
051004 120328 Processing pagesByURL: Sorted Infinity instructions/second
051004 120328 Processing pagesByURL: Merged to new DB containing 1
records in 0.0 seconds
051004 120328 Processing pagesByURL: Merged Infinity records/second
051004 120328 Processing pagesByMD5: Sorted 1 instructions in 0.031
seconds.
051004 120328 Processing pagesByMD5: Sorted 32.25806451612903
instructions/second
051004 120328 Processing pagesByMD5: Merged to new DB containing 1
records in 0.0 seconds
051004 120328 Processing pagesByMD5: Merged Infinity records/second
051004 120328 Processing linksByMD5: Copied file (0 bytes) in 0.0 secs.
051004 120328 Processing linksByURL: Copied file (0 bytes) in 0.016 secs.
051004 120328 FetchListTool started
051004 120329 Processing pagesByURL: Sorted 1 instructions in 0.047
seconds.
051004 120329 Processing pagesByURL: Sorted 21.27659574468085
instructions/second
051004 120329 Processing pagesByURL: Merged to new DB containing 1
records in 0.0 seconds
051004 120329 Processing pagesByURL: Merged Infinity records/second
051004 120329 Processing pagesByMD5: Sorted 1 instructions in 0.016
seconds.
051004 120329 Processing pagesByMD5: Sorted 62.5 instructions/second
051004 120329 Processing pagesByMD5: Merged to new DB containing 1
records in 0.0 seconds
051004 120329 Processing pagesByMD5: Merged Infinity records/second
051004 120329 Processing linksByMD5: Copied file (0 bytes) in 0.0 secs.
051004 120329 Processing linksByURL: Copied file (0 bytes) in 0.0 secs.
051004 120329 Processing
D:\workspaces\work\nutch-0.7.1\crawl-20051004120328\segments\20051004120328\fetchlist.unsorted:
Sorted 1 en
tries in 0.015 seconds.
051004 120329 Processing
D:\workspaces\work\nutch-0.7.1\crawl-20051004120328\segments\20051004120328\fetchlist.unsorted:
Sorted 66.6
6666666666667 entries/second
051004 120329 Overall processing: Sorted 1 entries in 0.015 seconds.
051004 120329 Overall processing: Sorted 0.015 entries/second
051004 120329 FetchListTool completed
051004 120329 logging at INFO
051004 120329 fetching file:///d:/
051004 120330 status: segment 20051004120328, 1 pages, 0 errors, 11062
bytes, 1000 ms
051004 120330 status: 1.0 pages/s, 86.421875 kb/s, 11062.0 bytes/page
051004 120331 Updating
D:\workspaces\work\nutch-0.7.1\crawl-20051004120328\db
Exception in thread "main" java.io.IOException: Impossible condition:
directories D:\workspaces\work\nutch-0.7.1\crawl-2005100412032
8\db\webdb.old and
D:\workspaces\work\nutch-0.7.1\crawl-20051004120328\db\webdb cannot
exist simultaneously
at org.apache.nutch.db.WebDBWriter.<init>(WebDBWriter.java:1484)
at org.apache.nutch.db.WebDBWriter.<init>(WebDBWriter.java:1457)
at
org.apache.nutch.tools.UpdateDatabaseTool.main(UpdateDatabaseTool.java:360)
at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:141)
.
Just delete D:\workspaces\work\nutch-0.7.1\crawl-20051004120328\db\webdb.old
Gal