This is a problem with Regex parsing. It has happened for me in
urlnormalizer where the URL was parsed incorrectly and for some reason
is extremely long or contains control characters. What happens is that
if the URL is really long (say thousands of characters) it goes into a
very inefficient algorithm (I believe O(n^3) but not sure) to find
certain features. I fixed this by having prefix-urlnormalizer check
first to see if the length of the URL is less than some constant (I have
defined as 1024). I also saw this problem happen the other day with the
.js parser. Essentially there was a page: http://www.magic-cadeaux.fr/
that had a javascript line that was 150000 slashes in a row. It parses
fine in a browser, but again this led to and endless regex loop.
If you find these are problems you can find the stuck task and do a kill
-SIGQUIT which will dump stack traces to stdout (redirected to
logs/userlogs/[task name]/stdout) and check to see if it's stuck in a
regex loop and what put it there.
--Ned
Bolle, Jeffrey F. wrote:
Apparently the job configuration file didn't make it through the
listserv. Here it is in the body of the e-mail.
Jeff
Job Configuration: JobId - job_200711261211_0026
name value
dfs.secondary.info.bindAddress 0.0.0.0
dfs.datanode.port 50010
dfs.client.buffer.dir ${hadoop.tmp.dir}/dfs/tmp
searcher.summary.length 20
generate.update.crawldb false
lang.ngram.max.length 4
tasktracker.http.port 50060
searcher.filter.cache.size 16
ftp.timeout 60000
hadoop.tmp.dir /tmp/hadoop-${user.name}
hadoop.native.lib true
map.sort.class org.apache.hadoop.mapred.MergeSorter
ftp.follow.talk false
indexer.mergeFactor 50
ipc.client.idlethreshold 4000
query.host.boost 2.0
mapred.system.dir /nutch/filesystem/mapreduce/system
ftp.password [EMAIL PROTECTED]
http.agent.version Nutch-1.0-dev
query.tag.boost 1.0
dfs.namenode.logging.level info
db.fetch.schedule.adaptive.sync_delta_rate 0.3
io.skip.checksum.errors false
urlfilter.automaton.file automaton-urlfilter.txt
fs.default.name cisserver:9000
db.ignore.external.links false
extension.ontology.urls
dfs.safemode.threshold.pct 0.999f
dfs.namenode.handler.count 10
plugin.folders plugins
mapred.tasktracker.dns.nameserver default
io.sort.factor 10
fetcher.threads.per.host.by.ip false
parser.html.impl neko
mapred.task.timeout 600000
mapred.max.tracker.failures 4
hadoop.rpc.socket.factory.class.default
org.apache.hadoop.net.StandardSocketFactory
db.update.additions.allowed true
fs.hdfs.impl org.apache.hadoop.dfs.DistributedFileSystem
indexer.score.power 0.5
ipc.client.maxidletime 120000
db.fetch.schedule.class org.apache.nutch.crawl.DefaultFetchSchedule
mapred.output.key.class org.apache.hadoop.io.Text
file.content.limit 10485760
http.agent.url http://poisk/index.php/Category:Systems
dfs.safemode.extension 30000
tasktracker.http.threads 40
db.fetch.schedule.adaptive.dec_rate 0.2
user.name nutch
mapred.output.compress false
io.bytes.per.checksum 512
fetcher.server.delay 0.2
searcher.summary.context 5
db.fetch.interval.default 2592000
searcher.max.time.tick_count -1
parser.html.form.use_action false
fs.trash.root ${hadoop.tmp.dir}/Trash
mapred.reduce.max.attempts 4
fs.ramfs.impl org.apache.hadoop.fs.InMemoryFileSystem
db.score.count.filtered false
fetcher.max.crawl.delay 30
dfs.info.port 50070
indexer.maxMergeDocs 2147483647
mapred.jar
/nutch/filesystem/mapreduce/local/jobTracker/job_200711261211_0026.jar
fs.s3.buffer.dir ${hadoop.tmp.dir}/s3
dfs.block.size 67108864
http.robots.403.allow true
ftp.content.limit 10485760
job.end.retry.attempts 0
fs.file.impl org.apache.hadoop.fs.LocalFileSystem
query.title.boost 1.5
mapred.speculative.execution true
mapred.local.dir.minspacestart 0
mapred.output.compression.type RECORD
mime.types.file tika-mimetypes.xml
generate.max.per.host.by.ip false
fetcher.parse false
db.default.fetch.interval 30
db.max.outlinks.per.page -1
analysis.common.terms.file common-terms.utf8
mapred.userlog.retain.hours 24
dfs.replication.max 512
http.redirect.max 5
local.cache.size 10737418240
mapred.min.split.size 0
mapred.map.tasks 18
fetcher.threads.fetch 10
mapred.child.java.opts -Xmx1500m
mapred.output.value.class org.apache.nutch.parse.ParseImpl
http.timeout 10000
http.content.limit 10485760
dfs.secondary.info.port 50090
ipc.server.listen.queue.size 128
encodingdetector.charset.min.confidence -1
mapred.inmem.merge.threshold 1000
job.end.retry.interval 30000
fs.checkpoint.dir ${hadoop.tmp.dir}/dfs/namesecondary
query.url.boost 4.0
mapred.reduce.tasks 6
db.score.link.external 1.0
query.anchor.boost 2.0
mapred.userlog.limit.kb 0
webinterface.private.actions false
db.max.inlinks 10000000
mapred.job.split.file
/nutch/filesystem/mapreduce/system/job_200711261211_0026/job.split
mapred.job.name parse crawl20071126/segments/20071126123442
dfs.datanode.dns.nameserver default
dfs.blockreport.intervalMsec 3600000
ftp.username anonymous
db.fetch.schedule.adaptive.inc_rate 0.4
searcher.max.hits -1
mapred.map.max.attempts 4
urlnormalizer.regex.file regex-normalize.xml
ftp.keep.connection false
searcher.filter.cache.threshold 0.05
mapred.job.tracker.handler.count 10
dfs.client.block.write.retries 3
mapred.input.format.class
org.apache.hadoop.mapred.SequenceFileInputFormat
http.verbose true
fetcher.threads.per.host 8
mapred.tasktracker.expiry.interval 600000
mapred.job.tracker.info.bindAddress 0.0.0.0
ipc.client.timeout 60000
keep.failed.task.files false
mapred.output.format.class
org.apache.nutch.parse.ParseOutputFormat
mapred.output.compression.codec
org.apache.hadoop.io.compress.DefaultCodec
io.map.index.skip 0
mapred.working.dir /user/nutch
tasktracker.http.bindAddress 0.0.0.0
io.seqfile.compression.type RECORD
mapred.reducer.class org.apache.nutch.parse.ParseSegment
lang.analyze.max.length 2048
db.fetch.schedule.adaptive.min_interval 60.0
http.agent.name Jeffcrawler
dfs.default.chunk.view.size 32768
hadoop.logfile.size 10000000
dfs.datanode.du.pct 0.98f
parser.caching.forbidden.policy content
http.useHttp11 false
fs.inmemory.size.mb 75
db.fetch.schedule.adaptive.sync_delta true
dfs.datanode.du.reserved 0
mapred.job.tracker.info.port 50030
plugin.auto-activation true
fs.checkpoint.period 3600
mapred.jobtracker.completeuserjobs.maximum 100
mapred.task.tracker.report.bindAddress 127.0.0.1
db.signature.text_profile.min_token_len 2
query.phrase.boost 1.0
lang.ngram.min.length 1
dfs.df.interval 60000
dfs.data.dir /nutch/filesystem/data
dfs.datanode.bindAddress 0.0.0.0
fs.s3.maxRetries 4
dfs.datanode.dns.interface default
http.agent.email Jeff
extension.clustering.hits-to-cluster 100
searcher.max.time.tick_length 200
http.agent.description Jeff's Crawler
query.lang.boost 0.0
mapred.local.dir /nutch/filesystem/mapreduce/local
fs.hftp.impl org.apache.hadoop.dfs.HftpFileSystem
mapred.mapper.class org.apache.nutch.parse.ParseSegment
fs.trash.interval 0
fs.s3.sleepTimeSeconds 10
dfs.replication.min 1
mapred.submit.replication 10
indexer.max.title.length 100
parser.character.encoding.default windows-1252
mapred.map.output.compression.codec
org.apache.hadoop.io.compress.DefaultCodec
mapred.tasktracker.dns.interface default
http.robots.agents Jeffcrawler,*
mapred.job.tracker cisserver:9001
dfs.heartbeat.interval 3
urlfilter.regex.file crawl-urlfilter.txt
io.seqfile.sorter.recordlimit 1000000
fetcher.store.content true
urlfilter.suffix.file suffix-urlfilter.txt
dfs.name.dir /nutch/filesystem/name
fetcher.verbose true
db.signature.class org.apache.nutch.crawl.MD5Signature
db.max.anchor.length 100
parse.plugin.file parse-plugins.xml
nutch.segment.name 20071126123442
mapred.local.dir.minspacekill 0
searcher.dir /var/nutch/crawl
fs.kfs.impl org.apache.hadoop.fs.kfs.KosmosFileSystem
plugin.includes
protocol-(httpclient|file|ftp)|creativecommons|clustering-carrot2|urlfi
lter-regex|parse-(html|js|text|pdf|rss|msword|msexcel|mspowerpoint|oo|s
wf|ext)|index-(basic|more|anchor)|language-identifier|analysis-(en|ru)|
query-(basic|more|site|url)|summary-basic|microformats-reltag|scoring-o
pic|subcollection
mapred.map.output.compression.type RECORD
mapred.temp.dir ${hadoop.tmp.dir}/mapred/temp
db.fetch.retry.max 3
query.cc.boost 0.0
dfs.replication 2
db.ignore.internal.links false
dfs.info.bindAddress 0.0.0.0
query.site.boost 0.0
searcher.hostgrouping.rawhits.factor 2.0
fetcher.server.min.delay 0.0
hadoop.logfile.count 10
indexer.termIndexInterval 128
file.content.ignored true
db.score.link.internal 1.0
io.seqfile.compress.blocksize 1000000
fs.s3.block.size 67108864
ftp.server.timeout 100000
http.max.delays 1000
indexer.minMergeDocs 50
mapred.reduce.parallel.copies 5
io.seqfile.lazydecompress true
mapred.output.dir
/user/nutch/crawl20071126/segments/20071126123442
indexer.max.tokens 10000000
io.sort.mb 100
ipc.client.connection.maxidletime 1000
db.fetch.schedule.adaptive.max_interval 31536000.0
mapred.compress.map.output false
ipc.client.kill.max 10
urlnormalizer.order
org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer
org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
ipc.client.connect.max.retries 10
urlfilter.prefix.file prefix-urlfilter.txt
db.signature.text_profile.quant_rate 0.01
query.type.boost 0.0
fs.s3.impl org.apache.hadoop.fs.s3.S3FileSystem
mime.type.magic true
generate.max.per.host -1
db.fetch.interval.max 7776000
urlnormalizer.loop.count 1
mapred.input.dir
/user/nutch/crawl20071126/segments/20071126123442/content
io.file.buffer.size 4096
db.score.injected 1.0
dfs.replication.considerLoad true
jobclient.output.filter FAILED
mapred.tasktracker.tasks.maximum 2
io.compression.codecs
org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compres
s.GzipCodec
fs.checkpoint.size 67108864
________________________________
From: Bolle, Jeffrey F. [mailto:[EMAIL PROTECTED]
Sent: Monday, November 26, 2007 3:08 PM
To: [email protected]
Subject: Crash in Parser
All,
I'm having some trouble with the Nutch nightly. It has been a
while since I last updated my crawl of our intranet. I was attempting
to run the crawl today and it failed with this:
Exception in thread "main" java.io.IOException: Job failed!
at
org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:831)
at
org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:142)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:126)
In the web interface it says that:
Task task_200711261211_0026_m_000015_0 filed to report status
for 602 seconds. Killing!
Task task_200711261211_0026_m_000015_1 filed to report status
for 601 seconds. Killing!
Task task_200711261211_0026_m_000015_2 filed to report status
for 601 seconds. Killing!
Task task_200711261211_0026_m_000015_3 filed to report status
for 602 seconds. Killing!
I don't have the fetchers set to parse. Nutch and hadoop are
running on a 3 node cluster. I've attached the job configuration file
as saved from the web interface.
Is there any way I can get more information on which file or
url the parse is failing on? Why doesn't the parsing of a file or URL
fail more cleanly?
Any recommendations on helping nutch avoid whatever is causing
the hang and allowing it to index the rest of the content?
Thanks.
Jeff Bolle