[ http://issues.apache.org/jira/browse/NUTCH-424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12461647 ]
Karsten Dello commented on NUTCH-424: ------------------------------------- Sorry for cloning, but I could not reopen the original issue. The problem persist with current stable nutch version (0.8.1) which uses nekohtml 0.9.4 I do parsing after fetching ("nutch parse") so I cannot see the url in the log file, though I have set log level to DEBUG - is there a way to accomplish this? Anyway, it seems to be exactly the same errror, here comes the output from jstack and "kill -SIGQUIT <pid>" (1) jstack output Attaching to process ID 27428, please wait... Debugger attached successfully. Client compiler detected. JVM version is 1.5.0_05-b05 Thread 27439: (state = BLOCKED) - java.lang.AbstractStringBuilder.expandCapacity(int) @bci=28, line=99 (Compiled frame) - java.lang.AbstractStringBuilder.append(java.lang.String) @bci=36, line=393 (Compiled frame) - java.lang.StringBuffer.append(java.lang.String) @bci=2, line=225 (Compiled frame) - org.apache.xerces.dom.CharacterDataImpl.appendData(java.lang.String) @bci=59 (Compiled frame) - org.cyberneko.html.parsers.DOMFragmentParser.characters(org.apache.xerces.xni.XMLString, org.apache.xerces.xni.Augmentations) @bci=117, line=463 (Compiled frame) - org.cyberneko.html.filters.DefaultFilter.characters(org.apache.xerces.xni.XMLString, org.apache.xerces.xni.Augmentations) @bci=13, line=195 (Compiled frame) - org.cyberneko.html.HTMLTagBalancer.characters(org.apache.xerces.xni.XMLString, org.apache.xerces.xni.Augmentations) @bci=294, line=821 (Compiled frame) - org.cyberneko.html.HTMLScanner$ContentScanner.scanCharacters() @bci=324, line=1972 (Compiled frame) - org.cyberneko.html.HTMLScanner$ContentScanner.scan(boolean) @bci=184, line=1775 (Compiled frame) - org.cyberneko.html.HTMLScanner.scanDocument(boolean) @bci=5, line=789 (Compiled frame) - org.cyberneko.html.HTMLConfiguration.parse(org.apache.xerces.xni.parser.XMLInputSource) @bci=7, line=431 (Compiled frame) - org.cyberneko.html.parsers.DOMFragmentParser.parse(org.xml.sax.InputSource, org.w3c.dom.DocumentFragment) @bci=93, line=164 (Compiled frame) - org.apache.nutch.parse.html.HtmlParser.parseNeko(org.xml.sax.InputSource) @bci=76, line=261 (Compiled frame) - org.apache.nutch.parse.html.HtmlParser.parse(org.xml.sax.InputSource) @bci=20, line=225 (Compiled frame) - org.apache.nutch.parse.ParseUtil.parse(org.apache.nutch.protocol.Content) @bci=145, line=82 (Compiled frame) - org.apache.nutch.parse.ParseSegment.map(org.apache.hadoop.io.WritableComparable, org.apache.hadoop.io.Writable, org.apache.hadoop.mapred.OutputCollector, org.apache.hadoop.mapred.Reporter) @bci=22, line=66 (Compiled frame) - org.apache.hadoop.mapred.MapRunner.run(org.apache.hadoop.mapred.RecordReader, org.apache.hadoop.mapred.OutputCollector, org.apache.hadoop.mapred.Reporter) @bci=55, line=48 (Compiled frame) - org.apache.hadoop.mapred.MapTask.run(org.apache.hadoop.mapred.JobConf, org.apache.hadoop.mapred.TaskUmbilicalProtocol) @bci=198, line=129 (Interpreted frame) - org.apache.hadoop.mapred.LocalJobRunner$Job.run() @bci=120, line=91 (Interpreted frame) Thread 27435: (state = BLOCKED) Thread 27434: (state = BLOCKED) - java.lang.Object.wait(long) @bci=0 (Interpreted frame) - java.lang.ref.ReferenceQueue.remove(long) @bci=44, line=116 (Compiled frame) - java.lang.ref.ReferenceQueue.remove() @bci=2, line=132 (Compiled frame) Thread 27433: (state = BLOCKED) - java.lang.Object.wait(long) @bci=0 (Interpreted frame) - java.lang.Object.wait() @bci=2, line=474 (Compiled frame) Thread 27428: (state = BLOCKED) - java.lang.Thread.sleep(long) @bci=0 (Interpreted frame) - org.apache.hadoop.mapred.JobClient.runJob(org.apache.hadoop.mapred.JobConf) @bci=67, line=332 (Interpreted frame) - org.apache.nutch.parse.ParseSegment.parse(org.apache.hadoop.fs.Path) @bci=303, line=120 (Interpreted frame) - org.apache.nutch.parse.ParseSegment.main(java.lang.String[]) @bci=43, line=138 (Interpreted frame) (2) kill -SIGQUIT Full thread dump Java HotSpot(TM) Client VM (1.5.0_05-b05 mixed mode): "Thread-0" prio=1 tid=0x08518da0 nid=0x6b2f waiting on condition [0xababa000..0xababb680] at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:99) at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:393) at java.lang.StringBuffer.append(StringBuffer.java:225) - locked <0x45a086c8> (a java.lang.StringBuffer) at org.apache.xerces.dom.CharacterDataImpl.appendData(Unknown Source) at org.cyberneko.html.parsers.DOMFragmentParser.characters(DOMFragmentParser.java:463) at org.cyberneko.html.filters.DefaultFilter.characters(DefaultFilter.java:195) at org.cyberneko.html.HTMLTagBalancer.characters(HTMLTagBalancer.java:821) at org.cyberneko.html.HTMLScanner$ContentScanner.scanCharacters(HTMLScanner.java:1972) at org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1775) at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:789) at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:478) at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:431) at org.cyberneko.html.parsers.DOMFragmentParser.parse(DOMFragmentParser.java:164) at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:261) at org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:225) at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:164) at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82) at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:66) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:129) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:91) "Low Memory Detector" daemon prio=1 tid=0x080c6d20 nid=0x6b2d runnable [0x00000000..0x00000000] "CompilerThread0" daemon prio=1 tid=0x080c57d0 nid=0x6b2c waiting on condition [0x00000000..0xa99d41e8] "Signal Dispatcher" daemon prio=1 tid=0x080c4938 nid=0x6b2b waiting on condition [0x00000000..0x00000000] "Finalizer" daemon prio=1 tid=0x080b9528 nid=0x6b2a in Object.wait() [0xa9891000..0xa9891500] at java.lang.Object.wait(Native Method) - waiting on <0x4cc933a8> (a java.lang.ref.ReferenceQueue$Lock) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116) - locked <0x4cc933a8> (a java.lang.ref.ReferenceQueue$Lock) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132) at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159) "Reference Handler" daemon prio=1 tid=0x080b8860 nid=0x6b29 in Object.wait() [0xa9810000..0xa9810580] at java.lang.Object.wait(Native Method) - waiting on <0x4cc93428> (a java.lang.ref.Reference$Lock) at java.lang.Object.wait(Object.java:474) at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116) - locked <0x4cc93428> (a java.lang.ref.Reference$Lock) "main" prio=1 tid=0x0805cba0 nid=0x6b24 waiting on condition [0xbfffc000..0xbfffcb58] at java.lang.Thread.sleep(Native Method) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:332) at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:120) at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:138) "VM Thread" prio=1 tid=0x080b5c40 nid=0x6b28 runnable "VM Periodic Task Thread" prio=1 tid=0x080c81b0 nid=0x6b2e waiting on condition > CLONE - Problem persists with Nutch 0.8.1 (Nekohtml 0.9.4) - NekoHTML's > DOMFragmentParser hangs on certain URLs > --------------------------------------------------------------------------------------------------------------- > > Key: NUTCH-424 > URL: http://issues.apache.org/jira/browse/NUTCH-424 > Project: Nutch > Issue Type: Bug > Components: fetcher > Environment: Linux and Windows > Reporter: Karsten Dello > > I've tracked down occasional fetcher hangs to NekoHTML's DOMFragmentParser > hanging certain HTML documents, for example, > http://www.inlandrevenue.gov.uk/charities/chapter_3.htm. > The thread dump on the hung parser is: > "CompilerThread0" daemon prio=1 tid=0x080c4c18 nid=0x47da waiting on > condition [0x00000000..0x8a3daf68] > "Signal Dispatcher" daemon prio=1 tid=0x080c3d60 nid=0x47d9 waiting on > condition [0x00000000..0x00000000] > "Finalizer" daemon prio=1 tid=0x080b8818 nid=0x47d8 in Object.wait() > [0x8a2a0000..0x8a2a0680] > at java.lang.Object.wait(Native Method) > - waiting on <0x4a60d058> (a java.lang.ref.ReferenceQueue$Lock) > at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116) > - locked <0x4a60d058> (a java.lang.ref.ReferenceQueue$Lock) > at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132) > at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159) > "Reference Handler" daemon prio=1 tid=0x080b7b50 nid=0x47d7 in Object.wait() > [0x8a21f000..0x8a21f800] > at java.lang.Object.wait(Native Method) > - waiting on <0x4a60d0d8> (a java.lang.ref.Reference$Lock) > at java.lang.Object.wait(Object.java:474) > at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116) > - locked <0x4a60d0d8> (a java.lang.ref.Reference$Lock) > "main" prio=1 tid=0x0805c170 nid=0x47d1 waiting on condition > [0xbfffc000..0xbfffcec8] > at > java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:99) > at > java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:393) > at java.lang.StringBuffer.append(StringBuffer.java:225) > - locked <0x45910118> (a java.lang.StringBuffer) > at org.apache.xerces.dom.CharacterDataImpl.appendData(Unknown Source) > at org.cyberneko.html.parsers.DOMFragmentParser.characters(Unknown > Source) > at org.cyberneko.html.filters.DefaultFilter.characters(Unknown Source) > at org.cyberneko.html.HTMLTagBalancer.characters(Unknown Source) > at > org.cyberneko.html.HTMLScanner$ContentScanner.scanCharacters(Unknown Source) > at org.cyberneko.html.HTMLScanner$ContentScanner.scan(Unknown Source) > at org.cyberneko.html.HTMLScanner.scanDocument(Unknown Source) > at org.cyberneko.html.HTMLConfiguration.parse(Unknown Source) > at org.cyberneko.html.HTMLConfiguration.parse(Unknown Source) > at org.cyberneko.html.parsers.DOMFragmentParser.parse(Unknown Source) > at net.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:157) > at net.nutch.parse.ParserChecker.main(ParserChecker.java:74) > "VM Thread" prio=1 tid=0x080b4f30 nid=0x47d6 runnable > "VM Periodic Task Thread" prio=1 tid=0x080c75f8 nid=0x47dc waiting on > condition > Using the URL mentioned above, I was able to successfully parse the file > using a normal NekoHTML DocumentParser. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers