Apparently the job configuration file didn't make it through the
listserv. Here it is in the body of the e-mail.
Jeff
Job Configuration: JobId - job_200711261211_0026
name value
dfs.secondary.info.bindAddress 0.0.0.0
dfs.datanode.port 50010
dfs.client.buffer.dir ${hadoop.tmp.dir}/dfs/tmp
searcher.summary.length 20
generate.update.crawldb false
lang.ngram.max.length 4
tasktracker.http.port 50060
searcher.filter.cache.size 16
ftp.timeout 60000
hadoop.tmp.dir /tmp/hadoop-${user.name}
hadoop.native.lib true
map.sort.class org.apache.hadoop.mapred.MergeSorter
ftp.follow.talk false
indexer.mergeFactor 50
ipc.client.idlethreshold 4000
query.host.boost 2.0
mapred.system.dir /nutch/filesystem/mapreduce/system
ftp.password [EMAIL PROTECTED]
http.agent.version Nutch-1.0-dev
query.tag.boost 1.0
dfs.namenode.logging.level info
db.fetch.schedule.adaptive.sync_delta_rate 0.3
io.skip.checksum.errors false
urlfilter.automaton.file automaton-urlfilter.txt
fs.default.name cisserver:9000
db.ignore.external.links false
extension.ontology.urls
dfs.safemode.threshold.pct 0.999f
dfs.namenode.handler.count 10
plugin.folders plugins
mapred.tasktracker.dns.nameserver default
io.sort.factor 10
fetcher.threads.per.host.by.ip false
parser.html.impl neko
mapred.task.timeout 600000
mapred.max.tracker.failures 4
hadoop.rpc.socket.factory.class.default
org.apache.hadoop.net.StandardSocketFactory
db.update.additions.allowed true
fs.hdfs.impl org.apache.hadoop.dfs.DistributedFileSystem
indexer.score.power 0.5
ipc.client.maxidletime 120000
db.fetch.schedule.class org.apache.nutch.crawl.DefaultFetchSchedule
mapred.output.key.class org.apache.hadoop.io.Text
file.content.limit 10485760
http.agent.url http://poisk/index.php/Category:Systems
dfs.safemode.extension 30000
tasktracker.http.threads 40
db.fetch.schedule.adaptive.dec_rate 0.2
user.name nutch
mapred.output.compress false
io.bytes.per.checksum 512
fetcher.server.delay 0.2
searcher.summary.context 5
db.fetch.interval.default 2592000
searcher.max.time.tick_count -1
parser.html.form.use_action false
fs.trash.root ${hadoop.tmp.dir}/Trash
mapred.reduce.max.attempts 4
fs.ramfs.impl org.apache.hadoop.fs.InMemoryFileSystem
db.score.count.filtered false
fetcher.max.crawl.delay 30
dfs.info.port 50070
indexer.maxMergeDocs 2147483647
mapred.jar
/nutch/filesystem/mapreduce/local/jobTracker/job_200711261211_0026.jar
fs.s3.buffer.dir ${hadoop.tmp.dir}/s3
dfs.block.size 67108864
http.robots.403.allow true
ftp.content.limit 10485760
job.end.retry.attempts 0
fs.file.impl org.apache.hadoop.fs.LocalFileSystem
query.title.boost 1.5
mapred.speculative.execution true
mapred.local.dir.minspacestart 0
mapred.output.compression.type RECORD
mime.types.file tika-mimetypes.xml
generate.max.per.host.by.ip false
fetcher.parse false
db.default.fetch.interval 30
db.max.outlinks.per.page -1
analysis.common.terms.file common-terms.utf8
mapred.userlog.retain.hours 24
dfs.replication.max 512
http.redirect.max 5
local.cache.size 10737418240
mapred.min.split.size 0
mapred.map.tasks 18
fetcher.threads.fetch 10
mapred.child.java.opts -Xmx1500m
mapred.output.value.class org.apache.nutch.parse.ParseImpl
http.timeout 10000
http.content.limit 10485760
dfs.secondary.info.port 50090
ipc.server.listen.queue.size 128
encodingdetector.charset.min.confidence -1
mapred.inmem.merge.threshold 1000
job.end.retry.interval 30000
fs.checkpoint.dir ${hadoop.tmp.dir}/dfs/namesecondary
query.url.boost 4.0
mapred.reduce.tasks 6
db.score.link.external 1.0
query.anchor.boost 2.0
mapred.userlog.limit.kb 0
webinterface.private.actions false
db.max.inlinks 10000000
mapred.job.split.file
/nutch/filesystem/mapreduce/system/job_200711261211_0026/job.split
mapred.job.name parse crawl20071126/segments/20071126123442
dfs.datanode.dns.nameserver default
dfs.blockreport.intervalMsec 3600000
ftp.username anonymous
db.fetch.schedule.adaptive.inc_rate 0.4
searcher.max.hits -1
mapred.map.max.attempts 4
urlnormalizer.regex.file regex-normalize.xml
ftp.keep.connection false
searcher.filter.cache.threshold 0.05
mapred.job.tracker.handler.count 10
dfs.client.block.write.retries 3
mapred.input.format.class
org.apache.hadoop.mapred.SequenceFileInputFormat
http.verbose true
fetcher.threads.per.host 8
mapred.tasktracker.expiry.interval 600000
mapred.job.tracker.info.bindAddress 0.0.0.0
ipc.client.timeout 60000
keep.failed.task.files false
mapred.output.format.class
org.apache.nutch.parse.ParseOutputFormat
mapred.output.compression.codec
org.apache.hadoop.io.compress.DefaultCodec
io.map.index.skip 0
mapred.working.dir /user/nutch
tasktracker.http.bindAddress 0.0.0.0
io.seqfile.compression.type RECORD
mapred.reducer.class org.apache.nutch.parse.ParseSegment
lang.analyze.max.length 2048
db.fetch.schedule.adaptive.min_interval 60.0
http.agent.name Jeffcrawler
dfs.default.chunk.view.size 32768
hadoop.logfile.size 10000000
dfs.datanode.du.pct 0.98f
parser.caching.forbidden.policy content
http.useHttp11 false
fs.inmemory.size.mb 75
db.fetch.schedule.adaptive.sync_delta true
dfs.datanode.du.reserved 0
mapred.job.tracker.info.port 50030
plugin.auto-activation true
fs.checkpoint.period 3600
mapred.jobtracker.completeuserjobs.maximum 100
mapred.task.tracker.report.bindAddress 127.0.0.1
db.signature.text_profile.min_token_len 2
query.phrase.boost 1.0
lang.ngram.min.length 1
dfs.df.interval 60000
dfs.data.dir /nutch/filesystem/data
dfs.datanode.bindAddress 0.0.0.0
fs.s3.maxRetries 4
dfs.datanode.dns.interface default
http.agent.email Jeff
extension.clustering.hits-to-cluster 100
searcher.max.time.tick_length 200
http.agent.description Jeff's Crawler
query.lang.boost 0.0
mapred.local.dir /nutch/filesystem/mapreduce/local
fs.hftp.impl org.apache.hadoop.dfs.HftpFileSystem
mapred.mapper.class org.apache.nutch.parse.ParseSegment
fs.trash.interval 0
fs.s3.sleepTimeSeconds 10
dfs.replication.min 1
mapred.submit.replication 10
indexer.max.title.length 100
parser.character.encoding.default windows-1252
mapred.map.output.compression.codec
org.apache.hadoop.io.compress.DefaultCodec
mapred.tasktracker.dns.interface default
http.robots.agents Jeffcrawler,*
mapred.job.tracker cisserver:9001
dfs.heartbeat.interval 3
urlfilter.regex.file crawl-urlfilter.txt
io.seqfile.sorter.recordlimit 1000000
fetcher.store.content true
urlfilter.suffix.file suffix-urlfilter.txt
dfs.name.dir /nutch/filesystem/name
fetcher.verbose true
db.signature.class org.apache.nutch.crawl.MD5Signature
db.max.anchor.length 100
parse.plugin.file parse-plugins.xml
nutch.segment.name 20071126123442
mapred.local.dir.minspacekill 0
searcher.dir /var/nutch/crawl
fs.kfs.impl org.apache.hadoop.fs.kfs.KosmosFileSystem
plugin.includes
protocol-(httpclient|file|ftp)|creativecommons|clustering-carrot2|urlfi
lter-regex|parse-(html|js|text|pdf|rss|msword|msexcel|mspowerpoint|oo|s
wf|ext)|index-(basic|more|anchor)|language-identifier|analysis-(en|ru)|
query-(basic|more|site|url)|summary-basic|microformats-reltag|scoring-o
pic|subcollection
mapred.map.output.compression.type RECORD
mapred.temp.dir ${hadoop.tmp.dir}/mapred/temp
db.fetch.retry.max 3
query.cc.boost 0.0
dfs.replication 2
db.ignore.internal.links false
dfs.info.bindAddress 0.0.0.0
query.site.boost 0.0
searcher.hostgrouping.rawhits.factor 2.0
fetcher.server.min.delay 0.0
hadoop.logfile.count 10
indexer.termIndexInterval 128
file.content.ignored true
db.score.link.internal 1.0
io.seqfile.compress.blocksize 1000000
fs.s3.block.size 67108864
ftp.server.timeout 100000
http.max.delays 1000
indexer.minMergeDocs 50
mapred.reduce.parallel.copies 5
io.seqfile.lazydecompress true
mapred.output.dir
/user/nutch/crawl20071126/segments/20071126123442
indexer.max.tokens 10000000
io.sort.mb 100
ipc.client.connection.maxidletime 1000
db.fetch.schedule.adaptive.max_interval 31536000.0
mapred.compress.map.output false
ipc.client.kill.max 10
urlnormalizer.order
org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer
org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
ipc.client.connect.max.retries 10
urlfilter.prefix.file prefix-urlfilter.txt
db.signature.text_profile.quant_rate 0.01
query.type.boost 0.0
fs.s3.impl org.apache.hadoop.fs.s3.S3FileSystem
mime.type.magic true
generate.max.per.host -1
db.fetch.interval.max 7776000
urlnormalizer.loop.count 1
mapred.input.dir
/user/nutch/crawl20071126/segments/20071126123442/content
io.file.buffer.size 4096
db.score.injected 1.0
dfs.replication.considerLoad true
jobclient.output.filter FAILED
mapred.tasktracker.tasks.maximum 2
io.compression.codecs
org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compres
s.GzipCodec
fs.checkpoint.size 67108864
________________________________
From: Bolle, Jeffrey F. [mailto:[EMAIL PROTECTED]
Sent: Monday, November 26, 2007 3:08 PM
To: [email protected]
Subject: Crash in Parser
All,
I'm having some trouble with the Nutch nightly. It has been a
while since I last updated my crawl of our intranet. I was attempting
to run the crawl today and it failed with this:
Exception in thread "main" java.io.IOException: Job failed!
at
org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:831)
at
org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:142)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:126)
In the web interface it says that:
Task task_200711261211_0026_m_000015_0 filed to report status
for 602 seconds. Killing!
Task task_200711261211_0026_m_000015_1 filed to report status
for 601 seconds. Killing!
Task task_200711261211_0026_m_000015_2 filed to report status
for 601 seconds. Killing!
Task task_200711261211_0026_m_000015_3 filed to report status
for 602 seconds. Killing!
I don't have the fetchers set to parse. Nutch and hadoop are
running on a 3 node cluster. I've attached the job configuration file
as saved from the web interface.
Is there any way I can get more information on which file or
url the parse is failing on? Why doesn't the parsing of a file or URL
fail more cleanly?
Any recommendations on helping nutch avoid whatever is causing
the hang and allowing it to index the rest of the content?
Thanks.
Jeff Bolle