We have seen this before too. If is the same problem it is the regex url filter. Comment out the -.*(/.+?)/.*?\1/.*?\1/ expression in the regex-urlfilter.txt file and it should resolve itself. Also search the forum for "Fetcher stops pushes cpu to 100%".

Dennis

Daniel Varela Santoalla wrote:

Hello all

I've seen this mentioned in the mailing list before but nobody provided a solution yet (or I didn't find it).

The problem is that this "deflateBytes" seems to hang for long periods (from minutes to more than an hour) making the whole crawling process really slow. I'm crawling a single domain from inside, so I want the process to be as quick as possible, and now it is taking 10+ hours. During all this "hung" time there is no apparent CPU usage by the java process.

Any ideas on how to proceed with this? It is quite annoying, specially since HtDig takes less than two hours to index the same content.

Otherwise we are quite happy with Nutch and impressed with all the features.

Regards
Daniel

------------------------------------------------------------------------

Full thread dump Java HotSpot(TM) Client VM (1.5.0_07-b03 mixed mode, sharing):

"fetcher6" prio=1 tid=0x084c1348 nid=0x2ea5 runnable [0x469f6000..0x469f6580]
        at java.util.zip.Deflater.deflateBytes(Native Method)
        at java.util.zip.Deflater.deflate(Deflater.java:284)
        - locked <0x4a08c228> (a java.util.zip.Deflater)
at java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:154) at java.util.zip.DeflaterOutputStream.write(DeflaterOutputStream.java:114)
        at java.util.zip.GZIPOutputStream.write(GZIPOutputStream.java:72)
        - locked <0x4a08c208> (a java.util.zip.GZIPOutputStream)
at org.apache.nutch.io.WritableUtils.writeCompressedByteArray(WritableUtils.java:53)
        at org.apache.nutch.protocol.Content.write(Content.java:81)
at org.apache.nutch.io.SequenceFile$Writer.append(SequenceFile.java:137)
        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:127)
        - locked <0x4c85d838> (a org.apache.nutch.io.ArrayFile$Writer)
        at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
        - locked <0x4c85d838> (a org.apache.nutch.io.ArrayFile$Writer)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:278)
        - locked <0x4c85d800> (a org.apache.nutch.io.ArrayFile$Writer)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)

------------------------------------------------------------------------

[ All other fetchers are blocked on the "outputPage" method, waiting for the previous thread to free the lock ]

------------------------------------------------------------------------

"fetcher7" prio=1 tid=0x084d85d0 nid=0x2ea6 waiting for monitor entry [0x46b79000..0x46b79600] at org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:277) - waiting to lock <0x4c85d800> (a org.apache.nutch.io.ArrayFile$Writer) at org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)



Reply via email to