Struth! Here's another problem as well. I'm trying to merge the segments I've 
created so far:

$ nutch mergesegs crawl/mergesegs_dir -dir crawl/segments
Merging 5 segments to crawl/mergesegs_dir/20080905223155
SegmentMerger:   adding file:/tmp/crawl/segments/20080905141605
SegmentMerger:   adding file:/tmp/crawl/segments/20080905141522
SegmentMerger:   adding file:/tmp/crawl/segments/20080905142231
SegmentMerger:   adding file:/tmp/crawl/segments/20080905153116
SegmentMerger:   adding file:/tmp/crawl/segments/20080905141348
SegmentMerger: using segment data from: crawl_generate

$ find crawl/mergesegs_dir
crawl/mergesegs_dir
crawl/mergesegs_dir/20080905223155
crawl/mergesegs_dir/20080905223155/crawl_generate
crawl/mergesegs_dir/20080905223155/crawl_generate/.part-00000.crc
crawl/mergesegs_dir/20080905223155/crawl_generate/part-00000

But when I run invertlinks, I get an error about a missing path:

$ mv crawl/segments crawl/BACKUPsegments
$ mv crawl/mergesegs_dir crawl/segments
$ nutch invertlinks crawl/linkdb -dir crawl/segments
LinkDb: starting
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: file:/tmp/crawl/segments/20080905223155
LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path doesnt exist 
: file:/tmp/crawl/segments/20080905223155/parse_data
        at 
org.apache.hadoop.mapred.FileInputFormat.validateInput(FileInputFormat.java:215)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:705)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:973)
        at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170)
        at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:285)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:248)



From: [EMAIL PROTECTED]
To: nutch-user@lucene.apache.org
Subject: FW: Job failed!
Date: Fri, 5 Sep 2008 21:09:13 +0000








Sort of figured out how to kickstart the crawl again.

Basically did:

$s1=ls -d crawl/segments/* | tail -1
bin/nutch updatedb crawl/crawldb $1
bin/nutch generate crawl/crawldb crawl/segments
$2=ls -d crawl/segments/* | tail -1
bin/nutch fetch $2

But unfortunately this is fetching the same urls as the previous fetch. :(

From: [EMAIL PROTECTED]
To: nutch-user@lucene.apache.org
Subject: RE: Job failed!
Date: Fri, 5 Sep 2008 09:45:00 +0000








Initially I just did a tail -10 so thought there were no errors, but there are 
a few actually. The pdf errors are my fault because I updated the pdf plugin 
with the latest PDFBox and FontBox jars from cvs on sf.net and missed out 
parse-pdf.jar on the rebuild. I'm not sure that's the reason why the job failed 
though. The log is 5MB so I can't really attach it all here but hopefully the 
last 200 lines gives an indication.

By the way, is there a way to kickstart this crawl off again without crawling 
from the start again?


tail -200 hadoop.log.2008-09-05
2008-09-05 03:41:22,360 WARN  parse.ParserFactory - at 
java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:22,360 WARN  parse.ParserFactory - at 
java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:22,360 WARN  parse.ParserFactory - at 
org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:22,360 WARN  parse.ParserFactory - ... 4 more
2008-09-05 03:41:22,360 WARN  parse.ParserFactory - 
ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf 
instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:22,360 WARN  parse.ParseUtil - Unable to successfully parse 
content 
http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/Premium+Service+Training+insert/$FILE/Premium+training.pdf
 of type application/pdf
2008-09-05 03:41:22,362 INFO  fetcher.Fetcher - fetching 
http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/CTP+-+Travel+Plan+Objectives?OpenDocument
2008-09-05 03:41:23,616 INFO  fetcher.Fetcher - fetching 
http://planetba.baplc.com/general/aptrix/aptsal.nsf/Content/ctcBA+Home%5CBusTools%5CC%5Ccomp+tckts%5Ccr+comp+tickets?OpenDocument
2008-09-05 03:41:24,745 INFO  fetcher.Fetcher - fetching 
http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/Notes+7+-+93+Rooms?OpenDocument
2008-09-05 03:41:26,033 INFO  fetcher.Fetcher - fetching 
http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/SCCM+January+/$FILE/SCCMonthly+-+NovDec%2C+07+%2808+Jan%2C+08%29.pdf
2008-09-05 03:41:27,215 WARN  parse.ParserFactory - 
org.apache.nutch.plugin.PluginRuntimeException: 
java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at 
org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at 
org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at 
org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - Caused by: 
java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at 
java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at 
java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at 
java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at 
java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at 
java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at 
org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - ... 4 more
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - 
ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf 
instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:27,216 WARN  parse.ParseUtil - Unable to successfully parse 
content 
http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/SCCM+January+/$FILE/SCCMonthly+-+NovDec%2C+07+%2808+Jan%2C+08%29.pdf
 of type application/pdf
2008-09-05 03:41:27,216 INFO  fetcher.Fetcher - fetching 
http://planetba.baplc.com/general/aptrix/bani.nsf/Content/XXXXLS%5FQ1Results%5F030807%5CXXXXLS%5FQ1Resultsvideo%5F030807?opendocument
2008-09-05 03:41:28,451 INFO  fetcher.Fetcher - fetching 
http://planetba.baplc.com/general/aptrix/aptflt.nsf/AttachmentsByTitle/Flight+Ops+News+Aug+2008/$FILE/FLIGHT+OPS_AUGUST_08+intranet+live.pdf
2008-09-05 03:41:29,760 INFO  fetcher.Fetcher - fetching 
http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/Virus+2+questions?OpenDocument
2008-09-05 03:41:30,789 INFO  fetcher.Fetcher - fetching 
http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/Gender+Reass+the+process?OpenDocument
2008-09-05 03:41:32,066 INFO  fetcher.Fetcher - fetching 
http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/LGW+Crew+Responsibilities/$FILE/Crew+Responsibilities.doc
2008-09-05 03:41:33,390 INFO  fetcher.Fetcher - fetching 
http://planetba.baplc.com/general/aptrix/aptflt.nsf/Content/Flight+Ops+Home%5CBusiness+Tools%5CFlight+Technical+Services%5CAircraft+Weights+%26+Evaluation%5CFleet+Weights+-+Aircraft+Weighing+Schedules?OpenDocument
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - 
org.apache.nutch.plugin.PluginRuntimeException: 
java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at 
org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at 
org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at 
org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - Caused by: 
java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at 
java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at 
java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at 
java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at 
java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at 
java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:34,563 WARN  parse.ParserFactory - at 
org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:34,563 WARN  parse.ParserFactory - ... 4 more
2008-09-05 03:41:34,563 WARN  parse.ParserFactory - 
ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf 
instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:34,563 WARN  parse.ParseUtil - Unable to successfully parse 
content 
http://planetba.baplc.com/general/aptrix/aptrix.nsf/AttachmentsByTitle/T5+Retail+-+T5+Ground+Level/$FILE/T5_Ground_Level.pdf
 of type application/pdf
2008-09-05 03:41:34,564 INFO  fetcher.Fetcher - fetching 
http://planetba.baplc.com/general/travel/stpg2.nsf/072561aa006322660725618c006b09a0/fc11f85e25deb736802574a30033c99e?OpenDocument
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - 
org.apache.nutch.plugin.PluginRuntimeException: 
java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at 
org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at 
org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at 
org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - Caused by: 
java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at 
java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at 
java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at 
java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at 
java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at 
java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at 
org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - ... 4 more
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - 
ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf 
instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:35,926 WARN  parse.ParseUtil - Unable to successfully parse 
content 
http://planetba.baplc.com/general/aptrix/aptrix.nsf/AttachmentsByTitle/Diversity+dignity+at+work+booklet/$FILE/Dignity+at+work+booklet.pdf
 of type application/pdf
2008-09-05 03:41:35,928 INFO  fetcher.Fetcher - fetching 
http://planetba.baplc.com/general/communications/wtps1.nsf/$lookup/1D94AD9A45B463638025730100263FDF
2008-09-05 03:41:36,988 INFO  fetcher.Fetcher - fetching 
http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/Barplus+Hints+and+Tips/$FILE/Barplus+Hints+and+Tips.pdf
2008-09-05 03:41:38,217 INFO  fetcher.Fetcher - fetching 
http://planetba.baplc.com/general/aptrix/apteng.nsf/Content/Engineering+Home%5CDepartment+Information%5CEngineering+IT+Support+%26+Delivery+Homepage%5CEngineering+Solution+Group+%28ESG%29+Homepage%5CKey+user+Guides?OpenDocument
2008-09-05 03:41:41,143 INFO  fetcher.Fetcher - fetching 
http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/Cultural+Awareness+Photo+Prize+Draw?OpenDocument
2008-09-05 03:41:42,278 WARN  parse.ParserFactory - 
org.apache.nutch.plugin.PluginRuntimeException: 
java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at 
org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at 
org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at 
org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - Caused by: 
java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at 
java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at 
java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at 
java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at 
java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at 
java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at 
org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - ... 4 more
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - 
ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf 
instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:42,279 WARN  parse.ParseUtil - Unable to successfully parse 
content 
http://planetba.baplc.com/general/aptrix/aptflt.nsf/AttachmentsByTitle/Flight+Ops+News+Aug+2008/$FILE/FLIGHT+OPS_AUGUST_08+intranet+live.pdf
 of type application/pdf
2008-09-05 03:41:42,313 INFO  fetcher.Fetcher - fetching 
http://planetba.baplc.com/general/aptrix/aptsal.nsf/Content/ctcBA+Home%5CBusTools%5CB%5Cbah%5CPromos+pckge%5CFlrda+08+EBO+WTP+upgde?OpenDocument
2008-09-05 03:41:42,342 INFO  fetcher.Fetcher - fetching 
http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/PMA+EG904+timescales?OpenDocument
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - 
org.apache.nutch.plugin.PluginRuntimeException: 
java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at 
org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at 
org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at 
org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - Caused by: 
java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at 
java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at 
java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at 
java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at 
java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at 
java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at 
org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - ... 4 more
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - 
ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf 
instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:52,279 WARN  parse.ParseUtil - Unable to successfully parse 
content 
http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/Barplus+Hints+and+Tips/$FILE/Barplus+Hints+and+Tips.pdf
 of type application/pdf
2008-09-05 03:41:55,927 WARN  mapred.LocalJobRunner - job_local_21
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid 
local directory for 
taskTracker/jobcache/job_local_21/job_local_21_map_0000/output/file.out
        at 
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:313)
        at 
org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
        at 
org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:61)
        at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:982)
        at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:698)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:220)
        at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:157)
2008-09-05 09:32:46,906 INFO  searcher.NutchBean - opening indexes in 
crawl/indexes
2008-09-05 09:32:47,002 INFO  plugin.PluginRepository - Plugins: looking in: 
/ok/appl/nutch-2008-09-04_04-01-27/plugins
2008-09-05 09:32:47,305 INFO  plugin.PluginRepository - Plugin Auto-activation 
mode: [true]
2008-09-05 09:32:47,305 INFO  plugin.PluginRepository - Registered Plugins:
2008-09-05 09:32:47,305 INFO  plugin.PluginRepository -         CyberNeko HTML 
Parser (lib-nekohtml)
2008-09-05 09:32:47,305 INFO  plugin.PluginRepository -         MSPowerPoint 
Parse Plug-in (parse-mspowerpoint)
2008-09-05 09:32:47,305 INFO  plugin.PluginRepository -         Site Query 
Filter (query-site)
2008-09-05 09:32:47,305 INFO  plugin.PluginRepository -         Http / Https 
Protocol Plug-in (protocol-httpclient)
2008-09-05 09:32:47,305 INFO  plugin.PluginRepository -         MSWord Parse 
Plug-in (parse-msword)
2008-09-05 09:32:47,305 INFO  plugin.PluginRepository -         Basic URL 
Normalizer (urlnormalizer-basic)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Pass-through 
URL Normalizer (urlnormalizer-pass)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Html Parse 
Plug-in (parse-html)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Regex URL 
Filter Framework (lib-regex-filter)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Basic Indexing 
Filter (index-basic)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Pdf Parse 
Plug-in (parse-pdf)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Basic 
Summarizer Plug-in (summary-basic)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         MSExcel Parse 
Plug-in (parse-msexcel)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Text Parse 
Plug-in (parse-text)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Jakarta POI - 
Java API To Access Microsoft Format Files (lib-jakarta-poi)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Regex URL 
Filter (urlfilter-regex)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Basic Query 
Filter (query-basic)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         HTTP Framework 
(lib-http)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         URL Query 
Filter (query-url)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Regex URL 
Normalizer (urlnormalizer-regex)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Parse MS 
Documents Framework (lib-parsems)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Zip Parse 
Plug-in (parse-zip)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         the nutch core 
extension points (nutch-extensionpoints)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         OPIC Scoring 
Plug-in (scoring-opic)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository - Registered 
Extension-Points:
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Nutch 
Summarizer (org.apache.nutch.searcher.Summarizer)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch Scoring 
(org.apache.nutch.scoring.ScoringFilter)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch Protocol 
(org.apache.nutch.protocol.Protocol)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch URL 
Normalizer (org.apache.nutch.net.URLNormalizer)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch URL 
Filter (org.apache.nutch.net.URLFilter)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         HTML Parse 
Filter (org.apache.nutch.parse.HtmlParseFilter)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch Online 
Search Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch Indexing 
Filter (org.apache.nutch.indexer.IndexingFilter)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch Content 
Parser (org.apache.nutch.parse.Parser)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Ontology Model 
Loader (org.apache.nutch.ontology.Ontology)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch Analysis 
(org.apache.nutch.analysis.NutchAnalyzer)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch Query 
Filter (org.apache.nutch.searcher.QueryFilter)
2008-09-05 09:32:47,342 INFO  searcher.NutchBean - opening segments in 
crawl/segments
2008-09-05 09:32:47,368 INFO  searcher.SummarizerFactory - Using the first 
summarizer extension found: Basic Summarizer
2008-09-05 09:32:47,371 INFO  searcher.NutchBean - opening linkdb in 
crawl/linkdb
2008-09-05 09:32:52,746 INFO  searcher.NutchBean - opening indexes in 
crawl/indexes
2008-09-05 09:32:52,791 INFO  plugin.PluginRepository - Plugins: looking in: 
/ok/appl/nutch-2008-09-04_04-01-27/plugins
2008-09-05 09:32:52,999 INFO  plugin.PluginRepository - Plugin Auto-activation 
mode: [true]
2008-09-05 09:32:52,999 INFO  plugin.PluginRepository - Registered Plugins:
2008-09-05 09:32:52,999 INFO  plugin.PluginRepository -         CyberNeko HTML 
Parser (lib-nekohtml)



> Subject: Re: Job failed!
> From: [EMAIL PROTECTED]
> To: nutch-user@lucene.apache.org
> Date: Fri, 5 Sep 2008 17:28:47 +0800
> 
> Could you show the whole hdaoop.log?
> 在 2008-09-05五的 08:46 +0000,Edward Quick写道:
> > Hi,
> > 
> > I ran a crawl last night 
> > 
> > bin/nutch crawl urls -dir crawl -depth 10
> > 
> > which collected 10612 pages, and then bailed out with the following error:
> > 
> > Exception in thread "main" java.io.IOException: Job failed!
> >         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1062)
> >         at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:552)
> >         at org.apache.nutch.crawl.Crawl.main(Crawl.java:122)
> > 
> > I checked there was enough space on the box, and there don't appear to be 
> > any errors in hadoop.log or the crawl output, so I'm stuck on what caused 
> > this.
> > 
> > Also, is there a way to pick up the crawl from where it stopped rather than 
> > having to rerun it all over again?
> > 
> > Thanks for any help.
> > 
> > Ed.
> > 
> > 
> > 
> > _________________________________________________________________
> > Discover Bird's Eye View now with Multimap from Live Search
> > http://clk.atdmt.com/UKM/go/111354026/direct/01/
> 
> 

Get Hotmail on your mobile from Vodafone  Try it Now

Get Hotmail on your mobile from Vodafone  Try it Now!

_________________________________________________________________
Make a mini you and download it into Windows Live Messenger
http://clk.atdmt.com/UKM/go/111354029/direct/01/

Reply via email to