Figured this one out, just in case some other newbe has the same problem.
Windows places hidden files in the urls dir if one customizes the folder
view. These files must be removed or Nutch thinks they url files and
processes them. One the hidden files are removed all is well.
jim s
anyone else has
----- Original Message -----
From: "jim shirreffs" <[EMAIL PROTECTED]>
To: "nutch lucene apache" <[email protected]>
Sent: Thursday, April 05, 2007 11:51 AM
Subject: Run Job Crashing
Nutch-0.8.1
Windows 2000/Windows XP
Java 1.6
cygwin1.dll nov/2004 and gygwin1 latest release
Very strange, ran the crawler once
S bin/nutch crawl urls -dir crawl -depth 3 -topN 50
and everything worked until this error
Indexer: starting
Indexer: linkdb: crawl/linkdb
Indexer: adding segment: crawl/segments/20070404094549
Indexer: adding segment: crawl/segments/20070404095026
Indexer: adding segment: crawl/segments/20070404095504
Optimizing index.
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:357)
at org.apache.nutch.indexer.Indexer.index(Indexer.java:296)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:121)
Tried running the crawler again
$ bin/nutch crawl urls -dir crawl -depth 3 -topN 50
and now I consistantly get this error
$ bin/nutch crawl urls -dir crawl -depth 3 -topN 50
run java in NUTCH_JAVA_HOME D:\java\jdk1.6
crawl started in: crawl
rootUrlDir = urls
threads = 10
depth = 3
topN = 50
Injector: starting
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:357)
at org.apache.nutch.crawl.Injector.inject(Injector.java:138)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:105)
I have one file localhost in my url dir and it looks like this
http://localhost
My crawl-urlfiler.xml looks like this
# The url filter file used by the crawl command.
# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.
# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'. The first matching pattern in the file
# determines whether a URL is included or ignored. If no pattern
# matches, the URL is ignored.
# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto|swf|sw):
# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png)$
# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]
# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/.+?)/.*?\1/.*?\1/
# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*localhost/
# skip everything else
My nutch-site.xml looks like this
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>http.agent.name</name>
<value>RadioCity</value>
<description></description>
</property>
<property>
<name>http.agent.description</name>
<value>nutch web crawler</value>
<description></description>
</property>
<property>
<name>http.agent.url</name>
<value>www.RadioCity.dynip.com/RadioCity/HtmlPages/Nutch</value>
<description></description>
</property>
<property>
<name>http.agent.email</name>
<value>jpsb at flash.net</value>
<description></description>
</property>
</configuration>
I am getting the same behavor on two separate hosts. If anyone can
suggest what I might be doing wrong I would greatly appreicate it.
jim s
PS tried to mail from a different host but did not see message in mailing
list. Hope only this messages gets into mailing list.