[Nutch-general] Run Job Crashing

jim shirreffs Thu, 05 Apr 2007 09:53:48 -0700

Nutch-0.8.1
Windows 2000/Windows XP
Java 1.6
cygwin1.dll  nov/2004 and gygwin1 latest release



Very strange, ran the crawler once

S bin/nutch crawl urls -dir crawl -depth 3 -topN 50

and everything worked until this error


Indexer: starting
Indexer: linkdb: crawl/linkdb
Indexer: adding segment: crawl/segments/20070404094549
Indexer: adding segment: crawl/segments/20070404095026
Indexer: adding segment: crawl/segments/20070404095504
Optimizing index.
Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:357)
        at org.apache.nutch.indexer.Indexer.index(Indexer.java:296)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:121)


Tried running the crawler again

$ bin/nutch crawl urls -dir crawl -depth 3 -topN 50

and now I consistantly get this error

$ bin/nutch crawl urls -dir crawl -depth 3 -topN 50
run java in NUTCH_JAVA_HOME D:\java\jdk1.6
crawl started in: crawl
rootUrlDir = urls
threads = 10
depth = 3
topN = 50
Injector: starting
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:357)
        at org.apache.nutch.crawl.Injector.inject(Injector.java:138)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:105)

I have one file localhost in my url dir and it looks like this

http://localhost

My  crawl-urlfiler.xml looks like this

# The url filter file used by the crawl command.

# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto|swf|sw):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png)$

# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]

# skip URLs with slash-delimited segment that repeats 3+ times, to break 
loops
-.*(/.+?)/.*?\1/.*?\1/

# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*localhost/

# skip everything else

My nutch-site.xml looks like this

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
  <name>http.agent.name</name>
  <value>RadioCity</value>
  <description></description>
</property>

<property>
  <name>http.agent.description</name>
  <value>nutch web crawler</value>
  <description></description>
</property>

<property>
  <name>http.agent.url</name>
  <value>www.RadioCity.dynip.com/RadioCity/HtmlPages/Nutch</value>
  <description></description>
</property>

<property>
  <name>http.agent.email</name>
  <value>jpsb at flash.net</value>
  <description></description>
</property>
</configuration>


I am getting the same behavor on two separate hosts.  If anyone can suggest 
what I might be doing wrong I would greatly appreicate it.

jim s

PS tried to mail from a different host but did not see message in mailing 
list.  Hope only this messages gets into mailing list. 


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Run Job Crashing

Reply via email to