Ned,
Thanks for the hint, I found the advice of using kill -s SIGQUIT in an
earlier post. Luckily, I just saw the hung thread on the machine and
managed to get the command in before Nutch killed it.
It doesn't appear that I am stuck in the regexp. I ran the command a
few times, here are the last two iterations:
2007-11-27 17:46:29
Full thread dump Java HotSpot(TM) Server VM (1.6.0_01-b06 mixed mode):
"Comm thread for task_200711270828_0031_m_000016_1" daemon prio=10
tid=0x52229c00 nid=0x33ce waiting on condition [0x5209a000..0x5209afb0]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at org.apache.hadoop.mapred.Task$1.run(Task.java:281)
at java.lang.Thread.run(Thread.java:619)
"[EMAIL PROTECTED]" daemon prio=10
tid=0x52231800 nid=0x33cd waiting on condition [0x520ec000..0x520ec130]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at
org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:558)
at java.lang.Thread.run(Thread.java:619)
"IPC Client connection to cisserver/192.168.100.215:9000" daemon
prio=10 tid=0x52203800 nid=0x33cc in Object.wait()
[0x5213c000..0x5213d0b0]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x572c5fe8> (a
org.apache.hadoop.ipc.Client$Connection)
at java.lang.Object.wait(Object.java:485)
at
org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:216)
- locked <0x572c5fe8> (a
org.apache.hadoop.ipc.Client$Connection)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:255)
"IPC Client connection to /127.0.0.1:51728" daemon prio=10
tid=0x52238c00 nid=0x33cb in Object.wait() [0x521a0000..0x521a0e30]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x572ca5e0> (a
org.apache.hadoop.ipc.Client$Connection)
at java.lang.Object.wait(Object.java:485)
at
org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:216)
- locked <0x572ca5e0> (a
org.apache.hadoop.ipc.Client$Connection)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:255)
"org.apache.hadoop.io.ObjectWritable Connection Culler" daemon prio=10
tid=0x52235000 nid=0x33ca waiting on condition [0x521f1000..0x521f1db0]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at
org.apache.hadoop.ipc.Client$ConnectionCuller.run(Client.java:404)
"Low Memory Detector" daemon prio=10 tid=0x08ad9400 nid=0x33c7 runnable
[0x00000000..0x00000000]
java.lang.Thread.State: RUNNABLE
"CompilerThread1" daemon prio=10 tid=0x08ad7800 nid=0x33c6 waiting on
condition [0x00000000..0x525d2688]
java.lang.Thread.State: RUNNABLE
"CompilerThread0" daemon prio=10 tid=0x08ad6400 nid=0x33c5 waiting on
condition [0x00000000..0x526535c8]
java.lang.Thread.State: RUNNABLE
"Signal Dispatcher" daemon prio=10 tid=0x08ad5000 nid=0x33c4 runnable
[0x00000000..0x00000000]
java.lang.Thread.State: RUNNABLE
"Finalizer" daemon prio=10 tid=0x08ac2000 nid=0x33c3 in Object.wait()
[0x528f5000..0x528f60b0]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x5727ddc8> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
- locked <0x5727ddc8> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
at
java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
"Reference Handler" daemon prio=10 tid=0x08ac1400 nid=0x33c2 in
Object.wait() [0x52946000..0x52946e30]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x5727dde8> (a java.lang.ref.Reference$Lock)
at java.lang.Object.wait(Object.java:485)
at
java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
- locked <0x5727dde8> (a java.lang.ref.Reference$Lock)
"main" prio=10 tid=0x089fd000 nid=0x33be runnable
[0xb7fab000..0xb7fac208]
java.lang.Thread.State: RUNNABLE
at java.util.Arrays.copyOf(Arrays.java:2882)
at
java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.ja
va:100)
at
java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:390)
at java.lang.StringBuffer.append(StringBuffer.java:224)
- locked <0xab27c430> (a java.lang.StringBuffer)
at org.apache.xerces.dom.CharacterDataImpl.appendData(Unknown
Source)
at
org.apache.nutch.parse.html.DOMBuilder.characters(DOMBuilder.java:405)
at org.ccil.cowan.tagsoup.Parser.pcdata(Parser.java:461)
at
org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:451)
at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:210)
at
org.apache.nutch.parse.html.HtmlParser.parseTagSoup(HtmlParser.java:222
)
at
org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:209)
at
org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:145)
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
at
org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:75)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1760)
"VM Thread" prio=10 tid=0x08abe800 nid=0x33c1 runnable
"GC task thread#0 (ParallelGC)" prio=10 tid=0x08a03c00 nid=0x33bf
runnable
"GC task thread#1 (ParallelGC)" prio=10 tid=0x08a04c00 nid=0x33c0
runnable
"VM Periodic Task Thread" prio=10 tid=0x08adac00 nid=0x33c8 waiting on
condition
JNI global references: 1196
Heap
PSYoungGen total 159424K, used 155715K [0xaa7a0000, 0xb4e40000,
0xb4e40000)
eden space 148224K, 97% used [0xaa7a0000,0xb34ccfa0,0xb3860000)
from space 11200K, 99% used [0xb3860000,0xb4343f40,0xb4350000)
to space 11200K, 0% used [0xb4350000,0xb4350000,0xb4e40000)
PSOldGen total 369088K, used 120964K [0x57240000, 0x6dab0000,
0xaa7a0000)
object space 369088K, 32% used [0x57240000,0x5e861268,0x6dab0000)
PSPermGen total 16384K, used 8760K [0x53240000, 0x54240000,
0x57240000)
object space 16384K, 53% used [0x53240000,0x53ace250,0x54240000)
2007-11-27 17:47:25
Full thread dump Java HotSpot(TM) Server VM (1.6.0_01-b06 mixed mode):
"Comm thread for task_200711270828_0031_m_000016_1" daemon prio=10
tid=0x52229c00 nid=0x33ce waiting on condition [0x5209a000..0x5209afb0]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at org.apache.hadoop.mapred.Task$1.run(Task.java:281)
at java.lang.Thread.run(Thread.java:619)
"[EMAIL PROTECTED]" daemon prio=10
tid=0x52231800 nid=0x33cd waiting on condition [0x520ec000..0x520ec130]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at
org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:558)
at java.lang.Thread.run(Thread.java:619)
"IPC Client connection to cisserver/192.168.100.215:9000" daemon
prio=10 tid=0x52203800 nid=0x33cc in Object.wait()
[0x5213c000..0x5213d0b0]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x572c5fe8> (a
org.apache.hadoop.ipc.Client$Connection)
at java.lang.Object.wait(Object.java:485)
at
org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:216)
- locked <0x572c5fe8> (a
org.apache.hadoop.ipc.Client$Connection)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:255)
"IPC Client connection to /127.0.0.1:51728" daemon prio=10
tid=0x52238c00 nid=0x33cb in Object.wait() [0x521a0000..0x521a0e30]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x572ca5e0> (a
org.apache.hadoop.ipc.Client$Connection)
at java.lang.Object.wait(Object.java:485)
at
org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:216)
- locked <0x572ca5e0> (a
org.apache.hadoop.ipc.Client$Connection)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:255)
"org.apache.hadoop.io.ObjectWritable Connection Culler" daemon prio=10
tid=0x52235000 nid=0x33ca waiting on condition [0x521f1000..0x521f1db0]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at
org.apache.hadoop.ipc.Client$ConnectionCuller.run(Client.java:404)
"Low Memory Detector" daemon prio=10 tid=0x08ad9400 nid=0x33c7 runnable
[0x00000000..0x00000000]
java.lang.Thread.State: RUNNABLE
"CompilerThread1" daemon prio=10 tid=0x08ad7800 nid=0x33c6 waiting on
condition [0x00000000..0x525d2688]
java.lang.Thread.State: RUNNABLE
"CompilerThread0" daemon prio=10 tid=0x08ad6400 nid=0x33c5 waiting on
condition [0x00000000..0x526535c8]
java.lang.Thread.State: RUNNABLE
"Signal Dispatcher" daemon prio=10 tid=0x08ad5000 nid=0x33c4 waiting on
condition [0x00000000..0x00000000]
java.lang.Thread.State: RUNNABLE
"Finalizer" daemon prio=10 tid=0x08ac2000 nid=0x33c3 in Object.wait()
[0x528f5000..0x528f60b0]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x5727ddc8> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
- locked <0x5727ddc8> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
at
java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
"Reference Handler" daemon prio=10 tid=0x08ac1400 nid=0x33c2 in
Object.wait() [0x52946000..0x52946e30]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x5727dde8> (a java.lang.ref.Reference$Lock)
at java.lang.Object.wait(Object.java:485)
at
java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
- locked <0x5727dde8> (a java.lang.ref.Reference$Lock)
"main" prio=10 tid=0x089fd000 nid=0x33be runnable
[0xb7fab000..0xb7fac208]
java.lang.Thread.State: RUNNABLE
at org.apache.xerces.dom.CharacterDataImpl.appendData(Unknown
Source)
at
org.apache.nutch.parse.html.DOMBuilder.characters(DOMBuilder.java:405)
at org.ccil.cowan.tagsoup.Parser.pcdata(Parser.java:461)
at
org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:451)
at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:210)
at
org.apache.nutch.parse.html.HtmlParser.parseTagSoup(HtmlParser.java:222
)
at
org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:209)
at
org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:145)
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
at
org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:75)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1760)
"VM Thread" prio=10 tid=0x08abe800 nid=0x33c1 runnable
"GC task thread#0 (ParallelGC)" prio=10 tid=0x08a03c00 nid=0x33bf
runnable
"GC task thread#1 (ParallelGC)" prio=10 tid=0x08a04c00 nid=0x33c0
runnable
"VM Periodic Task Thread" prio=10 tid=0x08adac00 nid=0x33c8 waiting on
condition
JNI global references: 1196
Heap
PSYoungGen total 159104K, used 137234K [0xaa7a0000, 0xb4e40000,
0xb4e40000)
eden space 147584K, 85% used [0xaa7a0000,0xb2272700,0xb37c0000)
from space 11520K, 99% used [0xb4300000,0xb4e32480,0xb4e40000)
to space 11520K, 0% used [0xb37c0000,0xb37c0000,0xb4300000)
PSOldGen total 412672K, used 132811K [0x57240000, 0x70540000,
0xaa7a0000)
object space 412672K, 32% used [0x57240000,0x5f3f2f78,0x70540000)
PSPermGen total 16384K, used 8760K [0x53240000, 0x54240000,
0x57240000)
object space 16384K, 53% used [0x53240000,0x53ace250,0x54240000)
For a long time it sat at the Java Arrays.copyOf, but it does appear to
have eventually returned from that. I think my problem may lie more in
making sure the thread JVMs have the necessary memory and that they
have the time to parse larger documents (10MB). Even so, it is
frustrating that this failure to parse one document kills the whole
parse job. Is there a way to make this more granualr on the document
level, and even as information is being added and try to return
whatever has been parsed already before the job hangs / times out /
throws an exception?
Thanks.
Jeff
-----Original Message-----
From: Ned Rockson [mailto:[EMAIL PROTECTED]
Sent: Tuesday, November 27, 2007 2:25 PM
To: [email protected]
Subject: Re: Crash in Parser
This is a problem with Regex parsing. It has happened for me in
urlnormalizer where the URL was parsed incorrectly and for some reason
is extremely long or contains control characters. What happens is that
if the URL is really long (say thousands of characters) it goes into a
very inefficient algorithm (I believe O(n^3) but not sure) to find
certain features. I fixed this by having prefix-urlnormalizer check
first to see if the length of the URL is less than some constant (I
have
defined as 1024). I also saw this problem happen the other day with
the
.js parser. Essentially there was a page: http://www.magic-cadeaux.fr/
that had a javascript line that was 150000 slashes in a row. It parses
fine in a browser, but again this led to and endless regex loop.
If you find these are problems you can find the stuck task and do a
kill
-SIGQUIT which will dump stack traces to stdout (redirected to
logs/userlogs/[task name]/stdout) and check to see if it's stuck in a
regex loop and what put it there.
--Ned
Bolle, Jeffrey F. wrote:
> Apparently the job configuration file didn't make it through the
> listserv. Here it is in the body of the e-mail.
>
> Jeff
>
>
> Job Configuration: JobId - job_200711261211_0026
>
>
> name value
> dfs.secondary.info.bindAddress 0.0.0.0
> dfs.datanode.port 50010
> dfs.client.buffer.dir ${hadoop.tmp.dir}/dfs/tmp
> searcher.summary.length 20
> generate.update.crawldb false
> lang.ngram.max.length 4
> tasktracker.http.port 50060
> searcher.filter.cache.size 16
> ftp.timeout 60000
> hadoop.tmp.dir /tmp/hadoop-${user.name}
> hadoop.native.lib true
> map.sort.class org.apache.hadoop.mapred.MergeSorter
> ftp.follow.talk false
> indexer.mergeFactor 50
> ipc.client.idlethreshold 4000
> query.host.boost 2.0
> mapred.system.dir /nutch/filesystem/mapreduce/system
> ftp.password [EMAIL PROTECTED]
> http.agent.version Nutch-1.0-dev
> query.tag.boost 1.0
> dfs.namenode.logging.level info
> db.fetch.schedule.adaptive.sync_delta_rate 0.3
> io.skip.checksum.errors false
> urlfilter.automaton.file automaton-urlfilter.txt
> fs.default.name cisserver:9000
> db.ignore.external.links false
> extension.ontology.urls
> dfs.safemode.threshold.pct 0.999f
> dfs.namenode.handler.count 10
> plugin.folders plugins
> mapred.tasktracker.dns.nameserver default
> io.sort.factor 10
> fetcher.threads.per.host.by.ip false
> parser.html.impl neko
> mapred.task.timeout 600000
> mapred.max.tracker.failures 4
> hadoop.rpc.socket.factory.class.default
> org.apache.hadoop.net.StandardSocketFactory
> db.update.additions.allowed true
> fs.hdfs.impl org.apache.hadoop.dfs.DistributedFileSystem
> indexer.score.power 0.5
> ipc.client.maxidletime 120000
> db.fetch.schedule.class
org.apache.nutch.crawl.DefaultFetchSchedule
>
> mapred.output.key.class org.apache.hadoop.io.Text
> file.content.limit 10485760
> http.agent.url http://poisk/index.php/Category:Systems
> dfs.safemode.extension 30000
> tasktracker.http.threads 40
> db.fetch.schedule.adaptive.dec_rate 0.2
> user.name nutch
> mapred.output.compress false
> io.bytes.per.checksum 512
> fetcher.server.delay 0.2
> searcher.summary.context 5
> db.fetch.interval.default 2592000
> searcher.max.time.tick_count -1
> parser.html.form.use_action false
> fs.trash.root ${hadoop.tmp.dir}/Trash
> mapred.reduce.max.attempts 4
> fs.ramfs.impl org.apache.hadoop.fs.InMemoryFileSystem
> db.score.count.filtered false
> fetcher.max.crawl.delay 30
> dfs.info.port 50070
> indexer.maxMergeDocs 2147483647
> mapred.jar
>
/nutch/filesystem/mapreduce/local/jobTracker/job_200711261211_0026.jar
>
> fs.s3.buffer.dir ${hadoop.tmp.dir}/s3
> dfs.block.size 67108864
> http.robots.403.allow true
> ftp.content.limit 10485760
> job.end.retry.attempts 0
> fs.file.impl org.apache.hadoop.fs.LocalFileSystem
> query.title.boost 1.5
> mapred.speculative.execution true
> mapred.local.dir.minspacestart 0
> mapred.output.compression.type RECORD
> mime.types.file tika-mimetypes.xml
> generate.max.per.host.by.ip false
> fetcher.parse false
> db.default.fetch.interval 30
> db.max.outlinks.per.page -1
> analysis.common.terms.file common-terms.utf8
> mapred.userlog.retain.hours 24
> dfs.replication.max 512
> http.redirect.max 5
> local.cache.size 10737418240
> mapred.min.split.size 0
> mapred.map.tasks 18
> fetcher.threads.fetch 10
> mapred.child.java.opts -Xmx1500m
> mapred.output.value.class org.apache.nutch.parse.ParseImpl
>
> http.timeout 10000
> http.content.limit 10485760
> dfs.secondary.info.port 50090
> ipc.server.listen.queue.size 128
> encodingdetector.charset.min.confidence -1
> mapred.inmem.merge.threshold 1000
> job.end.retry.interval 30000
> fs.checkpoint.dir ${hadoop.tmp.dir}/dfs/namesecondary
> query.url.boost 4.0
> mapred.reduce.tasks 6
> db.score.link.external 1.0
> query.anchor.boost 2.0
> mapred.userlog.limit.kb 0
> webinterface.private.actions false
> db.max.inlinks 10000000
> mapred.job.split.file
> /nutch/filesystem/mapreduce/system/job_200711261211_0026/job.split
>
> mapred.job.name parse crawl20071126/segments/20071126123442
> dfs.datanode.dns.nameserver default
> dfs.blockreport.intervalMsec 3600000
> ftp.username anonymous
> db.fetch.schedule.adaptive.inc_rate 0.4
> searcher.max.hits -1
> mapred.map.max.attempts 4
> urlnormalizer.regex.file regex-normalize.xml
> ftp.keep.connection false
> searcher.filter.cache.threshold 0.05
> mapred.job.tracker.handler.count 10
> dfs.client.block.write.retries 3
> mapred.input.format.class
> org.apache.hadoop.mapred.SequenceFileInputFormat
> http.verbose true
> fetcher.threads.per.host 8
> mapred.tasktracker.expiry.interval 600000
> mapred.job.tracker.info.bindAddress 0.0.0.0
> ipc.client.timeout 60000
> keep.failed.task.files false
> mapred.output.format.class
> org.apache.nutch.parse.ParseOutputFormat
> mapred.output.compression.codec
> org.apache.hadoop.io.compress.DefaultCodec
> io.map.index.skip 0
> mapred.working.dir /user/nutch
> tasktracker.http.bindAddress 0.0.0.0
> io.seqfile.compression.type RECORD
> mapred.reducer.class org.apache.nutch.parse.ParseSegment
> lang.analyze.max.length 2048
> db.fetch.schedule.adaptive.min_interval 60.0
> http.agent.name Jeffcrawler
> dfs.default.chunk.view.size 32768
> hadoop.logfile.size 10000000
> dfs.datanode.du.pct 0.98f
> parser.caching.forbidden.policy content
> http.useHttp11 false
> fs.inmemory.size.mb 75
> db.fetch.schedule.adaptive.sync_delta true
> dfs.datanode.du.reserved 0
> mapred.job.tracker.info.port 50030
> plugin.auto-activation true
> fs.checkpoint.period 3600
> mapred.jobtracker.completeuserjobs.maximum 100
> mapred.task.tracker.report.bindAddress 127.0.0.1
> db.signature.text_profile.min_token_len 2
> query.phrase.boost 1.0
> lang.ngram.min.length 1
> dfs.df.interval 60000
> dfs.data.dir /nutch/filesystem/data
> dfs.datanode.bindAddress 0.0.0.0
> fs.s3.maxRetries 4
> dfs.datanode.dns.interface default
> http.agent.email Jeff
> extension.clustering.hits-to-cluster 100
> searcher.max.time.tick_length 200
> http.agent.description Jeff's Crawler
> query.lang.boost 0.0
> mapred.local.dir /nutch/filesystem/mapreduce/local
> fs.hftp.impl org.apache.hadoop.dfs.HftpFileSystem
> mapred.mapper.class org.apache.nutch.parse.ParseSegment
> fs.trash.interval 0
> fs.s3.sleepTimeSeconds 10
> dfs.replication.min 1
> mapred.submit.replication 10
> indexer.max.title.length 100
> parser.character.encoding.default windows-1252
> mapred.map.output.compression.codec
> org.apache.hadoop.io.compress.DefaultCodec
> mapred.tasktracker.dns.interface default
> http.robots.agents Jeffcrawler,*
> mapred.job.tracker cisserver:9001
> dfs.heartbeat.interval 3
> urlfilter.regex.file crawl-urlfilter.txt
> io.seqfile.sorter.recordlimit 1000000
> fetcher.store.content true
> urlfilter.suffix.file suffix-urlfilter.txt
> dfs.name.dir /nutch/filesystem/name
> fetcher.verbose true
> db.signature.class org.apache.nutch.crawl.MD5Signature
> db.max.anchor.length 100
> parse.plugin.file parse-plugins.xml
> nutch.segment.name 20071126123442
> mapred.local.dir.minspacekill 0
> searcher.dir /var/nutch/crawl
> fs.kfs.impl org.apache.hadoop.fs.kfs.KosmosFileSystem
> plugin.includes
>
protocol-(httpclient|file|ftp)|creativecommons|clustering-carrot2|urlfi
>
lter-regex|parse-(html|js|text|pdf|rss|msword|msexcel|mspowerpoint|oo|s
>
wf|ext)|index-(basic|more|anchor)|language-identifier|analysis-(en|ru)|
>
query-(basic|more|site|url)|summary-basic|microformats-reltag|scoring-o
> pic|subcollection
> mapred.map.output.compression.type RECORD
> mapred.temp.dir ${hadoop.tmp.dir}/mapred/temp
> db.fetch.retry.max 3
> query.cc.boost 0.0
> dfs.replication 2
> db.ignore.internal.links false
> dfs.info.bindAddress 0.0.0.0
> query.site.boost 0.0
> searcher.hostgrouping.rawhits.factor 2.0
> fetcher.server.min.delay 0.0
> hadoop.logfile.count 10
> indexer.termIndexInterval 128
> file.content.ignored true
> db.score.link.internal 1.0
> io.seqfile.compress.blocksize 1000000
> fs.s3.block.size 67108864
> ftp.server.timeout 100000
> http.max.delays 1000
> indexer.minMergeDocs 50
> mapred.reduce.parallel.copies 5
> io.seqfile.lazydecompress true
> mapred.output.dir
> /user/nutch/crawl20071126/segments/20071126123442
> indexer.max.tokens 10000000
> io.sort.mb 100
> ipc.client.connection.maxidletime 1000
> db.fetch.schedule.adaptive.max_interval 31536000.0
> mapred.compress.map.output false
> ipc.client.kill.max 10
> urlnormalizer.order
> org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer
> org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
> ipc.client.connect.max.retries 10
> urlfilter.prefix.file prefix-urlfilter.txt
> db.signature.text_profile.quant_rate 0.01
> query.type.boost 0.0
> fs.s3.impl org.apache.hadoop.fs.s3.S3FileSystem
> mime.type.magic true
> generate.max.per.host -1
> db.fetch.interval.max 7776000
> urlnormalizer.loop.count 1
> mapred.input.dir
> /user/nutch/crawl20071126/segments/20071126123442/content
> io.file.buffer.size 4096
> db.score.injected 1.0
> dfs.replication.considerLoad true
> jobclient.output.filter FAILED
> mapred.tasktracker.tasks.maximum 2
> io.compression.codecs
>
org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compres
> s.GzipCodec
> fs.checkpoint.size 67108864
>
>
> ________________________________
>
> From: Bolle, Jeffrey F. [mailto:[EMAIL PROTECTED]
> Sent: Monday, November 26, 2007 3:08 PM
> To: [email protected]
> Subject: Crash in Parser
>
>
> All,
> I'm having some trouble with the Nutch nightly. It has been a
> while since I last updated my crawl of our intranet. I was
attempting
> to run the crawl today and it failed with this:
> Exception in thread "main" java.io.IOException: Job failed!
> at
> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:831)
> at
> org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:142)
> at org.apache.nutch.crawl.Crawl.main(Crawl.java:126)
>
> In the web interface it says that:
> Task task_200711261211_0026_m_000015_0 filed to report status
> for 602 seconds. Killing!
>
> Task task_200711261211_0026_m_000015_1 filed to report status
> for 601 seconds. Killing!
>
> Task task_200711261211_0026_m_000015_2 filed to report status
> for 601 seconds. Killing!
>
> Task task_200711261211_0026_m_000015_3 filed to report status
> for 602 seconds. Killing!
>
> I don't have the fetchers set to parse. Nutch and hadoop are
> running on a 3 node cluster. I've attached the job configuration
file
> as saved from the web interface.
>
> Is there any way I can get more information on which file or
> url the parse is failing on? Why doesn't the parsing of a file or
URL
> fail more cleanly?
>
> Any recommendations on helping nutch avoid whatever is causing
> the hang and allowing it to index the rest of the content?
>
> Thanks.
>
>
> Jeff Bolle
>
>
>
>