We have seen this before too. If is the same problem it is the regex url filter. Comment out the -.*(/.+?)/.*?\1/.*?\1/ expression in the regex-urlfilter.txt file and it should resolve itself. Also search the forum for "Fetcher stops pushes cpu to 100%".
Dennis Daniel Varela Santoalla wrote: > > Hello all > > I've seen this mentioned in the mailing list before but nobody > provided a solution yet (or I didn't find it). > > The problem is that this "deflateBytes" seems to hang for long periods > (from minutes to more than an hour) making the whole crawling process > really slow. I'm crawling a single domain from inside, so I want the > process to be as quick as possible, and now it is taking 10+ hours. > During all this "hung" time there is no apparent CPU usage by the java > process. > > Any ideas on how to proceed with this? It is quite annoying, specially > since HtDig takes less than two hours to index the same content. > > Otherwise we are quite happy with Nutch and impressed with all the > features. > > Regards > Daniel > > ------------------------------------------------------------------------ > > Full thread dump Java HotSpot(TM) Client VM (1.5.0_07-b03 mixed mode, > sharing): > > "fetcher6" prio=1 tid=0x084c1348 nid=0x2ea5 runnable > [0x469f6000..0x469f6580] > at java.util.zip.Deflater.deflateBytes(Native Method) > at java.util.zip.Deflater.deflate(Deflater.java:284) > - locked <0x4a08c228> (a java.util.zip.Deflater) > at > java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:154) > at > java.util.zip.DeflaterOutputStream.write(DeflaterOutputStream.java:114) > at java.util.zip.GZIPOutputStream.write(GZIPOutputStream.java:72) > - locked <0x4a08c208> (a java.util.zip.GZIPOutputStream) > at > org.apache.nutch.io.WritableUtils.writeCompressedByteArray(WritableUtils.java:53) > > > at org.apache.nutch.protocol.Content.write(Content.java:81) > at > org.apache.nutch.io.SequenceFile$Writer.append(SequenceFile.java:137) > at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:127) > - locked <0x4c85d838> (a org.apache.nutch.io.ArrayFile$Writer) > at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39) > - locked <0x4c85d838> (a org.apache.nutch.io.ArrayFile$Writer) > at > org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:278) > > - locked <0x4c85d800> (a org.apache.nutch.io.ArrayFile$Writer) > at > org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261) > > at > org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148) > > ------------------------------------------------------------------------ > > [ All other fetchers are blocked on the "outputPage" method, waiting > for the previous thread to free the lock ] > > ------------------------------------------------------------------------ > > "fetcher7" prio=1 tid=0x084d85d0 nid=0x2ea6 waiting for monitor entry > [0x46b79000..0x46b79600] > at > org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:277) > > - waiting to lock <0x4c85d800> (a > org.apache.nutch.io.ArrayFile$Writer) > at > org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261) > > at > org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148) > > > Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
