Figured this one out, just in case some other newbe has the same problem.

Windows places hidden files in the urls dir if one customizes the folder view. These files must be removed or Nutch thinks they url files and processes them. One the hidden files are removed all is well.

jim s



anyone else has
----- Original Message ----- From: "jim shirreffs" <[EMAIL PROTECTED]>
To: "nutch lucene apache" <[email protected]>
Sent: Thursday, April 05, 2007 11:51 AM
Subject: Run Job Crashing


Nutch-0.8.1
Windows 2000/Windows XP
Java 1.6
cygwin1.dll  nov/2004 and gygwin1 latest release


Very strange, ran the crawler once

S bin/nutch crawl urls -dir crawl -depth 3 -topN 50

and everything worked until this error


Indexer: starting
Indexer: linkdb: crawl/linkdb
Indexer: adding segment: crawl/segments/20070404094549
Indexer: adding segment: crawl/segments/20070404095026
Indexer: adding segment: crawl/segments/20070404095504
Optimizing index.
Exception in thread "main" java.io.IOException: Job failed!
       at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:357)
       at org.apache.nutch.indexer.Indexer.index(Indexer.java:296)
       at org.apache.nutch.crawl.Crawl.main(Crawl.java:121)


Tried running the crawler again

$ bin/nutch crawl urls -dir crawl -depth 3 -topN 50

and now I consistantly get this error

$ bin/nutch crawl urls -dir crawl -depth 3 -topN 50
run java in NUTCH_JAVA_HOME D:\java\jdk1.6
crawl started in: crawl
rootUrlDir = urls
threads = 10
depth = 3
topN = 50
Injector: starting
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Exception in thread "main" java.io.IOException: Job failed!
       at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:357)
       at org.apache.nutch.crawl.Injector.inject(Injector.java:138)
       at org.apache.nutch.crawl.Crawl.main(Crawl.java:105)

I have one file localhost in my url dir and it looks like this

http://localhost

My  crawl-urlfiler.xml looks like this

# The url filter file used by the crawl command.

# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto|swf|sw):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png)$

# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/.+?)/.*?\1/.*?\1/

# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*localhost/

# skip everything else

My nutch-site.xml looks like this

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
 <name>http.agent.name</name>
 <value>RadioCity</value>
 <description></description>
</property>

<property>
 <name>http.agent.description</name>
 <value>nutch web crawler</value>
 <description></description>
</property>

<property>
 <name>http.agent.url</name>
 <value>www.RadioCity.dynip.com/RadioCity/HtmlPages/Nutch</value>
 <description></description>
</property>

<property>
 <name>http.agent.email</name>
 <value>jpsb at flash.net</value>
 <description></description>
</property>
</configuration>


I am getting the same behavor on two separate hosts. If anyone can suggest what I might be doing wrong I would greatly appreicate it.

jim s

PS tried to mail from a different host but did not see message in mailing list. Hope only this messages gets into mailing list.

Reply via email to