This is a problem with Regex parsing. It has happened for me in urlnormalizer where the URL was parsed incorrectly and for some reason is extremely long or contains control characters. What happens is that if the URL is really long (say thousands of characters) it goes into a very inefficient algorithm (I believe O(n^3) but not sure) to find certain features. I fixed this by having prefix-urlnormalizer check first to see if the length of the URL is less than some constant (I have defined as 1024). I also saw this problem happen the other day with the .js parser. Essentially there was a page: http://www.magic-cadeaux.fr/ that had a javascript line that was 150000 slashes in a row. It parses fine in a browser, but again this led to and endless regex loop.

If you find these are problems you can find the stuck task and do a kill -SIGQUIT which will dump stack traces to stdout (redirected to logs/userlogs/[task name]/stdout) and check to see if it's stuck in a regex loop and what put it there.

--Ned

Bolle, Jeffrey F. wrote:
Apparently the job configuration file didn't make it through the
listserv.  Here it is in the body of the e-mail.
Jeff
Job Configuration: JobId - job_200711261211_0026


name     value  
dfs.secondary.info.bindAddress   0.0.0.0        
dfs.datanode.port        50010  
dfs.client.buffer.dir    ${hadoop.tmp.dir}/dfs/tmp      
searcher.summary.length  20     
generate.update.crawldb  false  
lang.ngram.max.length    4      
tasktracker.http.port    50060  
searcher.filter.cache.size       16     
ftp.timeout      60000  
hadoop.tmp.dir   /tmp/hadoop-${user.name}       
hadoop.native.lib        true   
map.sort.class   org.apache.hadoop.mapred.MergeSorter   
ftp.follow.talk  false  
indexer.mergeFactor      50     
ipc.client.idlethreshold         4000   
query.host.boost         2.0    
mapred.system.dir        /nutch/filesystem/mapreduce/system     
ftp.password     [EMAIL PROTECTED]      
http.agent.version       Nutch-1.0-dev  
query.tag.boost  1.0    
dfs.namenode.logging.level       info   
db.fetch.schedule.adaptive.sync_delta_rate       0.3    
io.skip.checksum.errors  false  
urlfilter.automaton.file         automaton-urlfilter.txt        
fs.default.name  cisserver:9000 
db.ignore.external.links         false  
extension.ontology.urls         
dfs.safemode.threshold.pct       0.999f 
dfs.namenode.handler.count       10     
plugin.folders   plugins        
mapred.tasktracker.dns.nameserver        default        
io.sort.factor   10     
fetcher.threads.per.host.by.ip   false  
parser.html.impl         neko   
mapred.task.timeout      600000 
mapred.max.tracker.failures      4      
hadoop.rpc.socket.factory.class.default
org.apache.hadoop.net.StandardSocketFactory     
db.update.additions.allowed      true   
fs.hdfs.impl     org.apache.hadoop.dfs.DistributedFileSystem    
indexer.score.power      0.5    
ipc.client.maxidletime   120000 
db.fetch.schedule.class  org.apache.nutch.crawl.DefaultFetchSchedule

mapred.output.key.class  org.apache.hadoop.io.Text      
file.content.limit       10485760       
http.agent.url   http://poisk/index.php/Category:Systems        
dfs.safemode.extension   30000  
tasktracker.http.threads         40     
db.fetch.schedule.adaptive.dec_rate      0.2    
user.name        nutch  
mapred.output.compress   false  
io.bytes.per.checksum    512    
fetcher.server.delay     0.2    
searcher.summary.context         5      
db.fetch.interval.default        2592000        
searcher.max.time.tick_count     -1     
parser.html.form.use_action      false  
fs.trash.root    ${hadoop.tmp.dir}/Trash        
mapred.reduce.max.attempts       4      
fs.ramfs.impl    org.apache.hadoop.fs.InMemoryFileSystem        
db.score.count.filtered  false  
fetcher.max.crawl.delay  30     
dfs.info.port    50070  
indexer.maxMergeDocs     2147483647     
mapred.jar
/nutch/filesystem/mapreduce/local/jobTracker/job_200711261211_0026.jar

fs.s3.buffer.dir         ${hadoop.tmp.dir}/s3   
dfs.block.size   67108864       
http.robots.403.allow    true   
ftp.content.limit        10485760       
job.end.retry.attempts   0      
fs.file.impl     org.apache.hadoop.fs.LocalFileSystem   
query.title.boost        1.5    
mapred.speculative.execution     true   
mapred.local.dir.minspacestart   0      
mapred.output.compression.type   RECORD 
mime.types.file  tika-mimetypes.xml     
generate.max.per.host.by.ip      false  
fetcher.parse    false  
db.default.fetch.interval        30     
db.max.outlinks.per.page         -1     
analysis.common.terms.file       common-terms.utf8      
mapred.userlog.retain.hours      24     
dfs.replication.max      512    
http.redirect.max        5      
local.cache.size         10737418240    
mapred.min.split.size    0      
mapred.map.tasks         18     
fetcher.threads.fetch    10     
mapred.child.java.opts   -Xmx1500m      
mapred.output.value.class        org.apache.nutch.parse.ParseImpl

http.timeout     10000  
http.content.limit       10485760       
dfs.secondary.info.port  50090  
ipc.server.listen.queue.size     128    
encodingdetector.charset.min.confidence  -1     
mapred.inmem.merge.threshold     1000   
job.end.retry.interval   30000  
fs.checkpoint.dir        ${hadoop.tmp.dir}/dfs/namesecondary    
query.url.boost  4.0    
mapred.reduce.tasks      6      
db.score.link.external   1.0    
query.anchor.boost       2.0    
mapred.userlog.limit.kb  0      
webinterface.private.actions     false  
db.max.inlinks   10000000       
mapred.job.split.file
/nutch/filesystem/mapreduce/system/job_200711261211_0026/job.split

mapred.job.name  parse crawl20071126/segments/20071126123442    
dfs.datanode.dns.nameserver      default        
dfs.blockreport.intervalMsec     3600000        
ftp.username     anonymous      
db.fetch.schedule.adaptive.inc_rate      0.4    
searcher.max.hits        -1     
mapred.map.max.attempts  4      
urlnormalizer.regex.file         regex-normalize.xml    
ftp.keep.connection      false  
searcher.filter.cache.threshold  0.05   
mapred.job.tracker.handler.count         10     
dfs.client.block.write.retries   3      
mapred.input.format.class
org.apache.hadoop.mapred.SequenceFileInputFormat        
http.verbose     true   
fetcher.threads.per.host         8      
mapred.tasktracker.expiry.interval       600000 
mapred.job.tracker.info.bindAddress      0.0.0.0        
ipc.client.timeout       60000  
keep.failed.task.files   false  
mapred.output.format.class
org.apache.nutch.parse.ParseOutputFormat        
mapred.output.compression.codec
org.apache.hadoop.io.compress.DefaultCodec      
io.map.index.skip        0      
mapred.working.dir       /user/nutch    
tasktracker.http.bindAddress     0.0.0.0        
io.seqfile.compression.type      RECORD 
mapred.reducer.class     org.apache.nutch.parse.ParseSegment    
lang.analyze.max.length  2048   
db.fetch.schedule.adaptive.min_interval  60.0   
http.agent.name  Jeffcrawler    
dfs.default.chunk.view.size      32768  
hadoop.logfile.size      10000000       
dfs.datanode.du.pct      0.98f  
parser.caching.forbidden.policy  content        
http.useHttp11   false  
fs.inmemory.size.mb      75     
db.fetch.schedule.adaptive.sync_delta    true   
dfs.datanode.du.reserved         0      
mapred.job.tracker.info.port     50030  
plugin.auto-activation   true   
fs.checkpoint.period     3600   
mapred.jobtracker.completeuserjobs.maximum       100    
mapred.task.tracker.report.bindAddress   127.0.0.1      
db.signature.text_profile.min_token_len  2      
query.phrase.boost       1.0    
lang.ngram.min.length    1      
dfs.df.interval  60000  
dfs.data.dir     /nutch/filesystem/data 
dfs.datanode.bindAddress         0.0.0.0        
fs.s3.maxRetries         4      
dfs.datanode.dns.interface       default        
http.agent.email         Jeff   
extension.clustering.hits-to-cluster     100    
searcher.max.time.tick_length    200    
http.agent.description   Jeff's Crawler 
query.lang.boost         0.0    
mapred.local.dir         /nutch/filesystem/mapreduce/local      
fs.hftp.impl     org.apache.hadoop.dfs.HftpFileSystem   
mapred.mapper.class      org.apache.nutch.parse.ParseSegment    
fs.trash.interval        0      
fs.s3.sleepTimeSeconds   10     
dfs.replication.min      1      
mapred.submit.replication        10     
indexer.max.title.length         100    
parser.character.encoding.default        windows-1252   
mapred.map.output.compression.codec
org.apache.hadoop.io.compress.DefaultCodec      
mapred.tasktracker.dns.interface         default        
http.robots.agents       Jeffcrawler,*  
mapred.job.tracker       cisserver:9001 
dfs.heartbeat.interval   3      
urlfilter.regex.file     crawl-urlfilter.txt    
io.seqfile.sorter.recordlimit    1000000        
fetcher.store.content    true   
urlfilter.suffix.file    suffix-urlfilter.txt   
dfs.name.dir     /nutch/filesystem/name 
fetcher.verbose  true   
db.signature.class       org.apache.nutch.crawl.MD5Signature    
db.max.anchor.length     100    
parse.plugin.file        parse-plugins.xml      
nutch.segment.name       20071126123442 
mapred.local.dir.minspacekill    0      
searcher.dir     /var/nutch/crawl       
fs.kfs.impl      org.apache.hadoop.fs.kfs.KosmosFileSystem      
plugin.includes
protocol-(httpclient|file|ftp)|creativecommons|clustering-carrot2|urlfi
lter-regex|parse-(html|js|text|pdf|rss|msword|msexcel|mspowerpoint|oo|s
wf|ext)|index-(basic|more|anchor)|language-identifier|analysis-(en|ru)|
query-(basic|more|site|url)|summary-basic|microformats-reltag|scoring-o
pic|subcollection       
mapred.map.output.compression.type       RECORD 
mapred.temp.dir  ${hadoop.tmp.dir}/mapred/temp  
db.fetch.retry.max       3      
query.cc.boost   0.0    
dfs.replication  2      
db.ignore.internal.links         false  
dfs.info.bindAddress     0.0.0.0        
query.site.boost         0.0    
searcher.hostgrouping.rawhits.factor     2.0    
fetcher.server.min.delay         0.0    
hadoop.logfile.count     10     
indexer.termIndexInterval        128    
file.content.ignored     true   
db.score.link.internal   1.0    
io.seqfile.compress.blocksize    1000000        
fs.s3.block.size         67108864       
ftp.server.timeout       100000 
http.max.delays  1000   
indexer.minMergeDocs     50     
mapred.reduce.parallel.copies    5      
io.seqfile.lazydecompress        true   
mapred.output.dir
/user/nutch/crawl20071126/segments/20071126123442       
indexer.max.tokens       10000000       
io.sort.mb       100    
ipc.client.connection.maxidletime        1000   
db.fetch.schedule.adaptive.max_interval  31536000.0     
mapred.compress.map.output       false  
ipc.client.kill.max      10     
urlnormalizer.order
org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer
org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer     
ipc.client.connect.max.retries   10     
urlfilter.prefix.file    prefix-urlfilter.txt   
db.signature.text_profile.quant_rate     0.01   
query.type.boost         0.0    
fs.s3.impl       org.apache.hadoop.fs.s3.S3FileSystem   
mime.type.magic  true   
generate.max.per.host    -1     
db.fetch.interval.max    7776000        
urlnormalizer.loop.count         1      
mapred.input.dir
/user/nutch/crawl20071126/segments/20071126123442/content       
io.file.buffer.size      4096   
db.score.injected        1.0    
dfs.replication.considerLoad     true   
jobclient.output.filter  FAILED 
mapred.tasktracker.tasks.maximum         2      
io.compression.codecs
org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compres
s.GzipCodec     
fs.checkpoint.size       67108864       


________________________________

From: Bolle, Jeffrey F. [mailto:[EMAIL PROTECTED] Sent: Monday, November 26, 2007 3:08 PM
        To: [email protected]
        Subject: Crash in Parser
        
        
        All,
        I'm having some trouble with the Nutch nightly.  It has been a
while since I last updated my crawl of our intranet.  I was attempting
to run the crawl today and it failed with this:
        Exception in thread "main" java.io.IOException: Job failed!
                at
org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:831)
                at
org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:142)
                at org.apache.nutch.crawl.Crawl.main(Crawl.java:126)
        
        In the web interface it says that:
        Task task_200711261211_0026_m_000015_0 filed to report status
for 602 seconds. Killing!
        
        Task task_200711261211_0026_m_000015_1 filed to report status
for 601 seconds. Killing!
        
        Task task_200711261211_0026_m_000015_2 filed to report status
for 601 seconds. Killing!
        
        Task task_200711261211_0026_m_000015_3 filed to report status
for 602 seconds. Killing!
I don't have the fetchers set to parse. Nutch and hadoop are
running on a 3 node cluster.  I've attached the job configuration file
as saved from the web interface.
Is there any way I can get more information on which file or
url the parse is failing on?  Why doesn't the parsing of a file or URL
fail more cleanly?
Any recommendations on helping nutch avoid whatever is causing
the hang and allowing it to index the rest of the content?
Thanks. Jeff Bolle


Reply via email to