Re: New to nutch, seem to be problems

misc Thu, 30 Aug 2007 12:25:39 -0700


Hello-

One more important piece of data about the problems that I am having.After waiting a really long time, I learned that fetch is not hung up, itwas just reeeeeeeealy slow. It took only a few hours to go through all theurls (the corresponding lines for each url appears in the hadoop.log, andall the content was loaded). Then it took 24 hours of waiting before thephrase "fetcher done" appeared. Then fetch returned. Why would fetch hangafter the crawl was done before returning?

Looking at the code it would seem that some of the fetcher threads mustbe stuck for a long time. Don't these time out?


                       thanks

----- Original Message -----From: "misc" <[EMAIL PROTECTED]>

To: <nutch-agent@lucene.apache.org>
Sent: Wednesday, August 29, 2007 5:31 PM
Subject: Re: New to nutch, seem to be problems

Hello-
I will reply to my own post with new findings and observations. Aboutthe slowness of generate, I just don't believe that it should take manyhours to generate (any sized) list on a database that is a couple millionlarge. I could do the equivalant on plain text lists using grep, sort,uniq in just minutes. I *must* be doing something wrong.
I dug into it today. Could someone correct me if I am wrong on any ofthis? I couldn't find any written information about this anywhere.
1. The generate seems to be broken into three phases, each a separatemapreduce command. The first phase runs through all the urls in thecrawldb, and throws out any that aren't eligable for crawling (bycrawldate).
2. The second phase partitions by hostname and ranks according tofrequency. It also cuts out repeat requests to a host if the number istoo high (set by a parameter), and then sorts the urls by frequency.
3. The third phase updates the database with the information that theurl is being crawled and should not be handed out to anyone else.
By observing what was going on, I could see that the first phase seemsto take a couple of hours. I can change the debug level of nutch to debugand see all the rejected urls being generated, and it does seem to beslow, a couple per second (my db has about 200k crawled things, and about2000000 uncrawled, so about 1 in 10 should be rejected.... How can nutchonly be going at a rate of about 20 per second, this is way too slow).
I also looked to see if DNS lookups were slowing me down, but as far asI can tell not. This is because the first phase doesn't even do DNS, yetis slow, and second because I used Wireshark to look for dns lookups andfound none.
Can someone tell me the expected time for generate to run? 6 hours istoo long!
                       thanks
                           -J
----- Original Message -----From: "misc" <[EMAIL PROTECTED]>
To: <nutch-agent@lucene.apache.org>
Sent: Tuesday, August 28, 2007 6:27 PM
Subject: New to nutch, seem to be problems


Hello-
My configuration and stats are at the end of this email. I have set upnutch to crawl 100,000 urls. The first pass (of 100,000) items went well,but problems started after this.
1. Generate takes many hours to complete. It doesn't matter whether Igenerate 1 million or 1000 items, it takes about 5 hours to complete. Isthis normal?
2. Fetch works great, until it is done. It then freezes upindefinitely. It can fetch 1000000 pages in about 12 hours, and all thefetched content is in /tmp, but then it just sits there, not returning tothe command line. I have let it sit for about 12 hours and eventuallybroke down and cancelled it. If I try to undate the database it of coursefails.
3. Fetch2 runs very slowly, even though I am using 80 threads, I onlydownload an object per every few seconds (1 every 5 or 10 seconds). Fromthe log, I can see that almost always 79 or 80 threads are spinWaiting.
4. I can't tell if fetch2 freezes like fetch does, as I haven't beenable to wait the many days it will take to go through a full fetch withfetch2.
Configuration:

   Core duo 2.4GhZ, 1 gig ram, 750GB hard drive.
The ethernet connection has a dedicated 1gb connection to the web, socertainly that isn't a problem.
   I have tested on nutch 0.9 and the newest daily build from 2007-08-28.
I seeded with urls from the opendirectory, 100000. I first ran a passto load all 100000, then took the topN=1million (10 times larger than thefirst set of urls). The first pass had no problem, the second pass (andbeyond) is where the problems began.

Re: New to nutch, seem to be problems

Reply via email to