Hi Daniel,

I've seen this mentioned in the mailing list before but nobody provided a solution yet (or I didn't find it).

The problem is that this "deflateBytes" seems to hang for long periods (from minutes to more than an hour) making the whole crawling process really slow. I'm crawling a single domain from inside, so I want the process to be as quick as possible, and now it is taking 10+ hours. During all this "hung" time there is no apparent CPU usage by the java process.

Any ideas on how to proceed with this? It is quite annoying, specially since HtDig takes less than two hours to index the same content.

We ran into something that was perhaps similar with Nutch 0.7, where it seemed like the problem was a combination of (a) really slow sites sending us (b) really big, compressed archive files.

Our solution, which we didn't positively verify, was to limit the max size of downloads to 10MB, and to terminate slow fetcher threads.

Limiting the max size does have the side effect of triggering errors during processing for PDFs, archives, etc that require a complete set of data to handle, unlike text documents such a HTML pages. It would be nice to be able to have a per-mime type size limit with a flag indicating whether to immediately abort the fetch if the server reports the size and this is bigger than the mime-type limit.

The fetcher thread termination wound up causing a lot of headaches - we had to add some checks for termination in a few loops, and there was at least one common case in the Jakarta httpclient code where it could get very unhappy if it was interrupted.

-- Ken


------------------------------------------------------------------------

Full thread dump Java HotSpot(TM) Client VM (1.5.0_07-b03 mixed mode, sharing):

"fetcher6" prio=1 tid=0x084c1348 nid=0x2ea5 runnable [0x469f6000..0x469f6580]
        at java.util.zip.Deflater.deflateBytes(Native Method)
        at java.util.zip.Deflater.deflate(Deflater.java:284)
        - locked <0x4a08c228> (a java.util.zip.Deflater)
at java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:154) at java.util.zip.DeflaterOutputStream.write(DeflaterOutputStream.java:114)
        at java.util.zip.GZIPOutputStream.write(GZIPOutputStream.java:72)
        - locked <0x4a08c208> (a java.util.zip.GZIPOutputStream)
at org.apache.nutch.io.WritableUtils.writeCompressedByteArray(WritableUtils.java:53)
        at org.apache.nutch.protocol.Content.write(Content.java:81)
at org.apache.nutch.io.SequenceFile$Writer.append(SequenceFile.java:137)
        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:127)
        - locked <0x4c85d838> (a org.apache.nutch.io.ArrayFile$Writer)
        at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
        - locked <0x4c85d838> (a org.apache.nutch.io.ArrayFile$Writer)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:278)
        - locked <0x4c85d800> (a org.apache.nutch.io.ArrayFile$Writer)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)

------------------------------------------------------------------------

[ All other fetchers are blocked on the "outputPage" method, waiting for the previous thread to free the lock ]

------------------------------------------------------------------------

"fetcher7" prio=1 tid=0x084d85d0 nid=0x2ea6 waiting for monitor entry [0x46b79000..0x46b79600] at org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:277) - waiting to lock <0x4c85d800> (a org.apache.nutch.io.ArrayFile$Writer) at org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)



--

Daniel Varela Santoalla
European Centre for Medium-Range Weather Forecasts (ECMWF) (http://www.ecmwf.int)


--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"

Reply via email to