We have seen this before too. If is the same problem it is the regex
url filter. Comment out the -.*(/.+?)/.*?\1/.*?\1/ expression in the
regex-urlfilter.txt file and it should resolve itself. Also search the
forum for "Fetcher stops pushes cpu to 100%".
Dennis
Daniel Varela Santoalla wrote:
Hello all
I've seen this mentioned in the mailing list before but nobody
provided a solution yet (or I didn't find it).
The problem is that this "deflateBytes" seems to hang for long periods
(from minutes to more than an hour) making the whole crawling process
really slow. I'm crawling a single domain from inside, so I want the
process to be as quick as possible, and now it is taking 10+ hours.
During all this "hung" time there is no apparent CPU usage by the java
process.
Any ideas on how to proceed with this? It is quite annoying, specially
since HtDig takes less than two hours to index the same content.
Otherwise we are quite happy with Nutch and impressed with all the
features.
Regards
Daniel
------------------------------------------------------------------------
Full thread dump Java HotSpot(TM) Client VM (1.5.0_07-b03 mixed mode,
sharing):
"fetcher6" prio=1 tid=0x084c1348 nid=0x2ea5 runnable
[0x469f6000..0x469f6580]
at java.util.zip.Deflater.deflateBytes(Native Method)
at java.util.zip.Deflater.deflate(Deflater.java:284)
- locked <0x4a08c228> (a java.util.zip.Deflater)
at
java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:154)
at
java.util.zip.DeflaterOutputStream.write(DeflaterOutputStream.java:114)
at java.util.zip.GZIPOutputStream.write(GZIPOutputStream.java:72)
- locked <0x4a08c208> (a java.util.zip.GZIPOutputStream)
at
org.apache.nutch.io.WritableUtils.writeCompressedByteArray(WritableUtils.java:53)
at org.apache.nutch.protocol.Content.write(Content.java:81)
at
org.apache.nutch.io.SequenceFile$Writer.append(SequenceFile.java:137)
at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:127)
- locked <0x4c85d838> (a org.apache.nutch.io.ArrayFile$Writer)
at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
- locked <0x4c85d838> (a org.apache.nutch.io.ArrayFile$Writer)
at
org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:278)
- locked <0x4c85d800> (a org.apache.nutch.io.ArrayFile$Writer)
at
org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261)
at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
------------------------------------------------------------------------
[ All other fetchers are blocked on the "outputPage" method, waiting
for the previous thread to free the lock ]
------------------------------------------------------------------------
"fetcher7" prio=1 tid=0x084d85d0 nid=0x2ea6 waiting for monitor entry
[0x46b79000..0x46b79600]
at
org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:277)
- waiting to lock <0x4c85d800> (a
org.apache.nutch.io.ArrayFile$Writer)
at
org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261)
at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)