Ned,
Thanks for the hint, I found the advice of using kill -s SIGQUIT in an
earlier post.  Luckily, I just saw the hung thread on the machine and
managed to get the command in before Nutch killed it.

It doesn't appear that I am stuck in the regexp.  I ran the command a
few times, here are the last two iterations:

2007-11-27 17:46:29
Full thread dump Java HotSpot(TM) Server VM (1.6.0_01-b06 mixed mode):

"Comm thread for task_200711270828_0031_m_000016_1" daemon prio=10
tid=0x52229c00 nid=0x33ce waiting on condition [0x5209a000..0x5209afb0]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
        at java.lang.Thread.sleep(Native Method)
        at org.apache.hadoop.mapred.Task$1.run(Task.java:281)
        at java.lang.Thread.run(Thread.java:619)

"[EMAIL PROTECTED]" daemon prio=10
tid=0x52231800 nid=0x33cd waiting on condition [0x520ec000..0x520ec130]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
        at java.lang.Thread.sleep(Native Method)
        at
org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:558)
        at java.lang.Thread.run(Thread.java:619)

"IPC Client connection to cisserver/192.168.100.215:9000" daemon
prio=10 tid=0x52203800 nid=0x33cc in Object.wait()
[0x5213c000..0x5213d0b0]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x572c5fe8> (a
org.apache.hadoop.ipc.Client$Connection)
        at java.lang.Object.wait(Object.java:485)
        at
org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:216)
        - locked <0x572c5fe8> (a
org.apache.hadoop.ipc.Client$Connection)
        at org.apache.hadoop.ipc.Client$Connection.run(Client.java:255)

"IPC Client connection to /127.0.0.1:51728" daemon prio=10
tid=0x52238c00 nid=0x33cb in Object.wait() [0x521a0000..0x521a0e30]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x572ca5e0> (a
org.apache.hadoop.ipc.Client$Connection)
        at java.lang.Object.wait(Object.java:485)
        at
org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:216)
        - locked <0x572ca5e0> (a
org.apache.hadoop.ipc.Client$Connection)
        at org.apache.hadoop.ipc.Client$Connection.run(Client.java:255)

"org.apache.hadoop.io.ObjectWritable Connection Culler" daemon prio=10
tid=0x52235000 nid=0x33ca waiting on condition [0x521f1000..0x521f1db0]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
        at java.lang.Thread.sleep(Native Method)
        at
org.apache.hadoop.ipc.Client$ConnectionCuller.run(Client.java:404)

"Low Memory Detector" daemon prio=10 tid=0x08ad9400 nid=0x33c7 runnable
[0x00000000..0x00000000]
   java.lang.Thread.State: RUNNABLE

"CompilerThread1" daemon prio=10 tid=0x08ad7800 nid=0x33c6 waiting on
condition [0x00000000..0x525d2688]
   java.lang.Thread.State: RUNNABLE

"CompilerThread0" daemon prio=10 tid=0x08ad6400 nid=0x33c5 waiting on
condition [0x00000000..0x526535c8]
   java.lang.Thread.State: RUNNABLE

"Signal Dispatcher" daemon prio=10 tid=0x08ad5000 nid=0x33c4 runnable
[0x00000000..0x00000000]
   java.lang.Thread.State: RUNNABLE

"Finalizer" daemon prio=10 tid=0x08ac2000 nid=0x33c3 in Object.wait()
[0x528f5000..0x528f60b0]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x5727ddc8> (a java.lang.ref.ReferenceQueue$Lock)
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
        - locked <0x5727ddc8> (a java.lang.ref.ReferenceQueue$Lock)
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
        at
java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)

"Reference Handler" daemon prio=10 tid=0x08ac1400 nid=0x33c2 in
Object.wait() [0x52946000..0x52946e30]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x5727dde8> (a java.lang.ref.Reference$Lock)
        at java.lang.Object.wait(Object.java:485)
        at
java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
        - locked <0x5727dde8> (a java.lang.ref.Reference$Lock)

"main" prio=10 tid=0x089fd000 nid=0x33be runnable
[0xb7fab000..0xb7fac208]
   java.lang.Thread.State: RUNNABLE
        at java.util.Arrays.copyOf(Arrays.java:2882)
        at
java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.ja
va:100)
        at
java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:390)
        at java.lang.StringBuffer.append(StringBuffer.java:224)
        - locked <0xab27c430> (a java.lang.StringBuffer)
        at org.apache.xerces.dom.CharacterDataImpl.appendData(Unknown
Source)
        at
org.apache.nutch.parse.html.DOMBuilder.characters(DOMBuilder.java:405)
        at org.ccil.cowan.tagsoup.Parser.pcdata(Parser.java:461)
        at
org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:451)
        at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:210)
        at
org.apache.nutch.parse.html.HtmlParser.parseTagSoup(HtmlParser.java:222
)
        at
org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:209)
        at
org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:145)
        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
        at
org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:75)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
        at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1760)

"VM Thread" prio=10 tid=0x08abe800 nid=0x33c1 runnable 

"GC task thread#0 (ParallelGC)" prio=10 tid=0x08a03c00 nid=0x33bf
runnable 

"GC task thread#1 (ParallelGC)" prio=10 tid=0x08a04c00 nid=0x33c0
runnable 

"VM Periodic Task Thread" prio=10 tid=0x08adac00 nid=0x33c8 waiting on
condition 

JNI global references: 1196

Heap
 PSYoungGen      total 159424K, used 155715K [0xaa7a0000, 0xb4e40000,
0xb4e40000)
  eden space 148224K, 97% used [0xaa7a0000,0xb34ccfa0,0xb3860000)
  from space 11200K, 99% used [0xb3860000,0xb4343f40,0xb4350000)
  to   space 11200K, 0% used [0xb4350000,0xb4350000,0xb4e40000)
 PSOldGen        total 369088K, used 120964K [0x57240000, 0x6dab0000,
0xaa7a0000)
  object space 369088K, 32% used [0x57240000,0x5e861268,0x6dab0000)
 PSPermGen       total 16384K, used 8760K [0x53240000, 0x54240000,
0x57240000)
  object space 16384K, 53% used [0x53240000,0x53ace250,0x54240000)

2007-11-27 17:47:25
Full thread dump Java HotSpot(TM) Server VM (1.6.0_01-b06 mixed mode):

"Comm thread for task_200711270828_0031_m_000016_1" daemon prio=10
tid=0x52229c00 nid=0x33ce waiting on condition [0x5209a000..0x5209afb0]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
        at java.lang.Thread.sleep(Native Method)
        at org.apache.hadoop.mapred.Task$1.run(Task.java:281)
        at java.lang.Thread.run(Thread.java:619)

"[EMAIL PROTECTED]" daemon prio=10
tid=0x52231800 nid=0x33cd waiting on condition [0x520ec000..0x520ec130]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
        at java.lang.Thread.sleep(Native Method)
        at
org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:558)
        at java.lang.Thread.run(Thread.java:619)

"IPC Client connection to cisserver/192.168.100.215:9000" daemon
prio=10 tid=0x52203800 nid=0x33cc in Object.wait()
[0x5213c000..0x5213d0b0]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x572c5fe8> (a
org.apache.hadoop.ipc.Client$Connection)
        at java.lang.Object.wait(Object.java:485)
        at
org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:216)
        - locked <0x572c5fe8> (a
org.apache.hadoop.ipc.Client$Connection)
        at org.apache.hadoop.ipc.Client$Connection.run(Client.java:255)

"IPC Client connection to /127.0.0.1:51728" daemon prio=10
tid=0x52238c00 nid=0x33cb in Object.wait() [0x521a0000..0x521a0e30]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x572ca5e0> (a
org.apache.hadoop.ipc.Client$Connection)
        at java.lang.Object.wait(Object.java:485)
        at
org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:216)
        - locked <0x572ca5e0> (a
org.apache.hadoop.ipc.Client$Connection)
        at org.apache.hadoop.ipc.Client$Connection.run(Client.java:255)

"org.apache.hadoop.io.ObjectWritable Connection Culler" daemon prio=10
tid=0x52235000 nid=0x33ca waiting on condition [0x521f1000..0x521f1db0]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
        at java.lang.Thread.sleep(Native Method)
        at
org.apache.hadoop.ipc.Client$ConnectionCuller.run(Client.java:404)

"Low Memory Detector" daemon prio=10 tid=0x08ad9400 nid=0x33c7 runnable
[0x00000000..0x00000000]
   java.lang.Thread.State: RUNNABLE

"CompilerThread1" daemon prio=10 tid=0x08ad7800 nid=0x33c6 waiting on
condition [0x00000000..0x525d2688]
   java.lang.Thread.State: RUNNABLE

"CompilerThread0" daemon prio=10 tid=0x08ad6400 nid=0x33c5 waiting on
condition [0x00000000..0x526535c8]
   java.lang.Thread.State: RUNNABLE

"Signal Dispatcher" daemon prio=10 tid=0x08ad5000 nid=0x33c4 waiting on
condition [0x00000000..0x00000000]
   java.lang.Thread.State: RUNNABLE

"Finalizer" daemon prio=10 tid=0x08ac2000 nid=0x33c3 in Object.wait()
[0x528f5000..0x528f60b0]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x5727ddc8> (a java.lang.ref.ReferenceQueue$Lock)
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
        - locked <0x5727ddc8> (a java.lang.ref.ReferenceQueue$Lock)
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
        at
java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)

"Reference Handler" daemon prio=10 tid=0x08ac1400 nid=0x33c2 in
Object.wait() [0x52946000..0x52946e30]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x5727dde8> (a java.lang.ref.Reference$Lock)
        at java.lang.Object.wait(Object.java:485)
        at
java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
        - locked <0x5727dde8> (a java.lang.ref.Reference$Lock)

"main" prio=10 tid=0x089fd000 nid=0x33be runnable
[0xb7fab000..0xb7fac208]
   java.lang.Thread.State: RUNNABLE
        at org.apache.xerces.dom.CharacterDataImpl.appendData(Unknown
Source)
        at
org.apache.nutch.parse.html.DOMBuilder.characters(DOMBuilder.java:405)
        at org.ccil.cowan.tagsoup.Parser.pcdata(Parser.java:461)
        at
org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:451)
        at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:210)
        at
org.apache.nutch.parse.html.HtmlParser.parseTagSoup(HtmlParser.java:222
)
        at
org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:209)
        at
org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:145)
        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
        at
org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:75)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
        at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1760)

"VM Thread" prio=10 tid=0x08abe800 nid=0x33c1 runnable 

"GC task thread#0 (ParallelGC)" prio=10 tid=0x08a03c00 nid=0x33bf
runnable 

"GC task thread#1 (ParallelGC)" prio=10 tid=0x08a04c00 nid=0x33c0
runnable 

"VM Periodic Task Thread" prio=10 tid=0x08adac00 nid=0x33c8 waiting on
condition 

JNI global references: 1196

Heap
 PSYoungGen      total 159104K, used 137234K [0xaa7a0000, 0xb4e40000,
0xb4e40000)
  eden space 147584K, 85% used [0xaa7a0000,0xb2272700,0xb37c0000)
  from space 11520K, 99% used [0xb4300000,0xb4e32480,0xb4e40000)
  to   space 11520K, 0% used [0xb37c0000,0xb37c0000,0xb4300000)
 PSOldGen        total 412672K, used 132811K [0x57240000, 0x70540000,
0xaa7a0000)
  object space 412672K, 32% used [0x57240000,0x5f3f2f78,0x70540000)
 PSPermGen       total 16384K, used 8760K [0x53240000, 0x54240000,
0x57240000)
  object space 16384K, 53% used [0x53240000,0x53ace250,0x54240000) 


For a long time it sat at the Java Arrays.copyOf, but it does appear to
have eventually returned from that.  I think my problem may lie more in
making sure the thread JVMs have the necessary memory and that they
have the time to parse larger documents (10MB).  Even so, it is
frustrating that this failure to parse one document kills the whole
parse job.  Is there a way to make this more granualr on the document
level, and even as information is being added and try to return
whatever has been parsed already before the job hangs / times out /
throws an exception?

Thanks.

Jeff


-----Original Message-----
From: Ned Rockson [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, November 27, 2007 2:25 PM
To: [email protected]
Subject: Re: Crash in Parser

This is a problem with Regex parsing.  It has happened for me in 
urlnormalizer where the URL was parsed incorrectly and for some reason 
is extremely long or contains control characters.  What happens is that

if the URL is really long (say thousands of characters) it goes into a 
very inefficient algorithm (I believe O(n^3) but not sure) to find 
certain features.  I fixed this by having prefix-urlnormalizer check 
first to see if the length of the URL is less than some constant (I
have 
defined as 1024).  I also saw this problem happen the other day with
the 
.js parser.  Essentially there was a page: http://www.magic-cadeaux.fr/

that had a javascript line that was 150000 slashes in a row.  It parses

fine in a browser, but again this led to and endless regex loop.

If you find these are problems you can find the stuck task and do a
kill 
-SIGQUIT which will dump stack traces to stdout (redirected to 
logs/userlogs/[task name]/stdout) and check to see if it's stuck in a 
regex loop and what put it there.

--Ned

Bolle, Jeffrey F. wrote:
> Apparently the job configuration file didn't make it through the
> listserv.  Here it is in the body of the e-mail.
>  
> Jeff
>  
>
> Job Configuration: JobId - job_200711261211_0026
>
>
> name   value  
> dfs.secondary.info.bindAddress         0.0.0.0        
> dfs.datanode.port      50010  
> dfs.client.buffer.dir  ${hadoop.tmp.dir}/dfs/tmp      
> searcher.summary.length        20     
> generate.update.crawldb        false  
> lang.ngram.max.length  4      
> tasktracker.http.port  50060  
> searcher.filter.cache.size     16     
> ftp.timeout    60000  
> hadoop.tmp.dir         /tmp/hadoop-${user.name}       
> hadoop.native.lib      true   
> map.sort.class         org.apache.hadoop.mapred.MergeSorter   
> ftp.follow.talk        false  
> indexer.mergeFactor    50     
> ipc.client.idlethreshold       4000   
> query.host.boost       2.0    
> mapred.system.dir      /nutch/filesystem/mapreduce/system     
> ftp.password   [EMAIL PROTECTED]      
> http.agent.version     Nutch-1.0-dev  
> query.tag.boost        1.0    
> dfs.namenode.logging.level     info   
> db.fetch.schedule.adaptive.sync_delta_rate     0.3    
> io.skip.checksum.errors        false  
> urlfilter.automaton.file       automaton-urlfilter.txt        
> fs.default.name        cisserver:9000 
> db.ignore.external.links       false  
> extension.ontology.urls               
> dfs.safemode.threshold.pct     0.999f 
> dfs.namenode.handler.count     10     
> plugin.folders         plugins        
> mapred.tasktracker.dns.nameserver      default        
> io.sort.factor         10     
> fetcher.threads.per.host.by.ip         false  
> parser.html.impl       neko   
> mapred.task.timeout    600000 
> mapred.max.tracker.failures    4      
> hadoop.rpc.socket.factory.class.default
> org.apache.hadoop.net.StandardSocketFactory   
> db.update.additions.allowed    true   
> fs.hdfs.impl   org.apache.hadoop.dfs.DistributedFileSystem    
> indexer.score.power    0.5    
> ipc.client.maxidletime         120000 
> db.fetch.schedule.class
org.apache.nutch.crawl.DefaultFetchSchedule
>
> mapred.output.key.class        org.apache.hadoop.io.Text      
> file.content.limit     10485760       
> http.agent.url         http://poisk/index.php/Category:Systems

> dfs.safemode.extension         30000  
> tasktracker.http.threads       40     
> db.fetch.schedule.adaptive.dec_rate    0.2    
> user.name      nutch  
> mapred.output.compress         false  
> io.bytes.per.checksum  512    
> fetcher.server.delay   0.2    
> searcher.summary.context       5      
> db.fetch.interval.default      2592000        
> searcher.max.time.tick_count   -1     
> parser.html.form.use_action    false  
> fs.trash.root  ${hadoop.tmp.dir}/Trash        
> mapred.reduce.max.attempts     4      
> fs.ramfs.impl  org.apache.hadoop.fs.InMemoryFileSystem        
> db.score.count.filtered        false  
> fetcher.max.crawl.delay        30     
> dfs.info.port  50070  
> indexer.maxMergeDocs   2147483647     
> mapred.jar
>
/nutch/filesystem/mapreduce/local/jobTracker/job_200711261211_0026.jar
>
> fs.s3.buffer.dir       ${hadoop.tmp.dir}/s3   
> dfs.block.size         67108864       
> http.robots.403.allow  true   
> ftp.content.limit      10485760       
> job.end.retry.attempts         0      
> fs.file.impl   org.apache.hadoop.fs.LocalFileSystem   
> query.title.boost      1.5    
> mapred.speculative.execution   true   
> mapred.local.dir.minspacestart         0      
> mapred.output.compression.type         RECORD 
> mime.types.file        tika-mimetypes.xml     
> generate.max.per.host.by.ip    false  
> fetcher.parse  false  
> db.default.fetch.interval      30     
> db.max.outlinks.per.page       -1     
> analysis.common.terms.file     common-terms.utf8      
> mapred.userlog.retain.hours    24     
> dfs.replication.max    512    
> http.redirect.max      5      
> local.cache.size       10737418240    
> mapred.min.split.size  0      
> mapred.map.tasks       18     
> fetcher.threads.fetch  10     
> mapred.child.java.opts         -Xmx1500m      
> mapred.output.value.class      org.apache.nutch.parse.ParseImpl
>
> http.timeout   10000  
> http.content.limit     10485760       
> dfs.secondary.info.port        50090  
> ipc.server.listen.queue.size   128    
> encodingdetector.charset.min.confidence        -1     
> mapred.inmem.merge.threshold   1000   
> job.end.retry.interval         30000  
> fs.checkpoint.dir      ${hadoop.tmp.dir}/dfs/namesecondary    
> query.url.boost        4.0    
> mapred.reduce.tasks    6      
> db.score.link.external         1.0    
> query.anchor.boost     2.0    
> mapred.userlog.limit.kb        0      
> webinterface.private.actions   false  
> db.max.inlinks         10000000       
> mapred.job.split.file
> /nutch/filesystem/mapreduce/system/job_200711261211_0026/job.split
>
> mapred.job.name        parse crawl20071126/segments/20071126123442

> dfs.datanode.dns.nameserver    default        
> dfs.blockreport.intervalMsec   3600000        
> ftp.username   anonymous      
> db.fetch.schedule.adaptive.inc_rate    0.4    
> searcher.max.hits      -1     
> mapred.map.max.attempts        4      
> urlnormalizer.regex.file       regex-normalize.xml    
> ftp.keep.connection    false  
> searcher.filter.cache.threshold        0.05   
> mapred.job.tracker.handler.count       10     
> dfs.client.block.write.retries         3      
> mapred.input.format.class
> org.apache.hadoop.mapred.SequenceFileInputFormat      
> http.verbose   true   
> fetcher.threads.per.host       8      
> mapred.tasktracker.expiry.interval     600000 
> mapred.job.tracker.info.bindAddress    0.0.0.0        
> ipc.client.timeout     60000  
> keep.failed.task.files         false  
> mapred.output.format.class
> org.apache.nutch.parse.ParseOutputFormat      
> mapred.output.compression.codec
> org.apache.hadoop.io.compress.DefaultCodec    
> io.map.index.skip      0      
> mapred.working.dir     /user/nutch    
> tasktracker.http.bindAddress   0.0.0.0        
> io.seqfile.compression.type    RECORD 
> mapred.reducer.class   org.apache.nutch.parse.ParseSegment    
> lang.analyze.max.length        2048   
> db.fetch.schedule.adaptive.min_interval        60.0   
> http.agent.name        Jeffcrawler    
> dfs.default.chunk.view.size    32768  
> hadoop.logfile.size    10000000       
> dfs.datanode.du.pct    0.98f  
> parser.caching.forbidden.policy        content        
> http.useHttp11         false  
> fs.inmemory.size.mb    75     
> db.fetch.schedule.adaptive.sync_delta  true   
> dfs.datanode.du.reserved       0      
> mapred.job.tracker.info.port   50030  
> plugin.auto-activation         true   
> fs.checkpoint.period   3600   
> mapred.jobtracker.completeuserjobs.maximum     100    
> mapred.task.tracker.report.bindAddress         127.0.0.1      
> db.signature.text_profile.min_token_len        2      
> query.phrase.boost     1.0    
> lang.ngram.min.length  1      
> dfs.df.interval        60000  
> dfs.data.dir   /nutch/filesystem/data 
> dfs.datanode.bindAddress       0.0.0.0        
> fs.s3.maxRetries       4      
> dfs.datanode.dns.interface     default        
> http.agent.email       Jeff   
> extension.clustering.hits-to-cluster   100    
> searcher.max.time.tick_length  200    
> http.agent.description         Jeff's Crawler 
> query.lang.boost       0.0    
> mapred.local.dir       /nutch/filesystem/mapreduce/local      
> fs.hftp.impl   org.apache.hadoop.dfs.HftpFileSystem   
> mapred.mapper.class    org.apache.nutch.parse.ParseSegment    
> fs.trash.interval      0      
> fs.s3.sleepTimeSeconds         10     
> dfs.replication.min    1      
> mapred.submit.replication      10     
> indexer.max.title.length       100    
> parser.character.encoding.default      windows-1252   
> mapred.map.output.compression.codec
> org.apache.hadoop.io.compress.DefaultCodec    
> mapred.tasktracker.dns.interface       default        
> http.robots.agents     Jeffcrawler,*  
> mapred.job.tracker     cisserver:9001 
> dfs.heartbeat.interval         3      
> urlfilter.regex.file   crawl-urlfilter.txt    
> io.seqfile.sorter.recordlimit  1000000        
> fetcher.store.content  true   
> urlfilter.suffix.file  suffix-urlfilter.txt   
> dfs.name.dir   /nutch/filesystem/name 
> fetcher.verbose        true   
> db.signature.class     org.apache.nutch.crawl.MD5Signature    
> db.max.anchor.length   100    
> parse.plugin.file      parse-plugins.xml      
> nutch.segment.name     20071126123442 
> mapred.local.dir.minspacekill  0      
> searcher.dir   /var/nutch/crawl       
> fs.kfs.impl    org.apache.hadoop.fs.kfs.KosmosFileSystem      
> plugin.includes
>
protocol-(httpclient|file|ftp)|creativecommons|clustering-carrot2|urlfi
>
lter-regex|parse-(html|js|text|pdf|rss|msword|msexcel|mspowerpoint|oo|s
>
wf|ext)|index-(basic|more|anchor)|language-identifier|analysis-(en|ru)|
>
query-(basic|more|site|url)|summary-basic|microformats-reltag|scoring-o
> pic|subcollection     
> mapred.map.output.compression.type     RECORD 
> mapred.temp.dir        ${hadoop.tmp.dir}/mapred/temp  
> db.fetch.retry.max     3      
> query.cc.boost         0.0    
> dfs.replication        2      
> db.ignore.internal.links       false  
> dfs.info.bindAddress   0.0.0.0        
> query.site.boost       0.0    
> searcher.hostgrouping.rawhits.factor   2.0    
> fetcher.server.min.delay       0.0    
> hadoop.logfile.count   10     
> indexer.termIndexInterval      128    
> file.content.ignored   true   
> db.score.link.internal         1.0    
> io.seqfile.compress.blocksize  1000000        
> fs.s3.block.size       67108864       
> ftp.server.timeout     100000 
> http.max.delays        1000   
> indexer.minMergeDocs   50     
> mapred.reduce.parallel.copies  5      
> io.seqfile.lazydecompress      true   
> mapred.output.dir
> /user/nutch/crawl20071126/segments/20071126123442     
> indexer.max.tokens     10000000       
> io.sort.mb     100    
> ipc.client.connection.maxidletime      1000   
> db.fetch.schedule.adaptive.max_interval        31536000.0     
> mapred.compress.map.output     false  
> ipc.client.kill.max    10     
> urlnormalizer.order
> org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer
> org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer   
> ipc.client.connect.max.retries         10     
> urlfilter.prefix.file  prefix-urlfilter.txt   
> db.signature.text_profile.quant_rate   0.01   
> query.type.boost       0.0    
> fs.s3.impl     org.apache.hadoop.fs.s3.S3FileSystem   
> mime.type.magic        true   
> generate.max.per.host  -1     
> db.fetch.interval.max  7776000        
> urlnormalizer.loop.count       1      
> mapred.input.dir
> /user/nutch/crawl20071126/segments/20071126123442/content     
> io.file.buffer.size    4096   
> db.score.injected      1.0    
> dfs.replication.considerLoad   true   
> jobclient.output.filter        FAILED 
> mapred.tasktracker.tasks.maximum       2      
> io.compression.codecs
>
org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compres
> s.GzipCodec   
> fs.checkpoint.size     67108864       
>
>
> ________________________________
>
>       From: Bolle, Jeffrey F. [mailto:[EMAIL PROTECTED] 
>       Sent: Monday, November 26, 2007 3:08 PM
>       To: [email protected]
>       Subject: Crash in Parser
>       
>       
>       All,
>       I'm having some trouble with the Nutch nightly.  It has been a
> while since I last updated my crawl of our intranet.  I was
attempting
> to run the crawl today and it failed with this:
>       Exception in thread "main" java.io.IOException: Job failed!
>               at
> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:831)
>               at
> org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:142)
>               at org.apache.nutch.crawl.Crawl.main(Crawl.java:126)
>       
>       In the web interface it says that:
>       Task task_200711261211_0026_m_000015_0 filed to report status
> for 602 seconds. Killing!
>       
>       Task task_200711261211_0026_m_000015_1 filed to report status
> for 601 seconds. Killing!
>       
>       Task task_200711261211_0026_m_000015_2 filed to report status
> for 601 seconds. Killing!
>       
>       Task task_200711261211_0026_m_000015_3 filed to report status
> for 602 seconds. Killing!
>        
>       I don't have the fetchers set to parse.  Nutch and hadoop are
> running on a 3 node cluster.  I've attached the job configuration
file
> as saved from the web interface.
>        
>       Is there any way I can get more information on which file or
> url the parse is failing on?  Why doesn't the parsing of a file or
URL
> fail more cleanly?
>        
>       Any recommendations on helping nutch avoid whatever is causing
> the hang and allowing it to index the rest of the content?
>        
>       Thanks.
>        
>        
>       Jeff Bolle
>        
>
>
>   

Reply via email to