Re: bug with generate performance

Doğacan Güney Fri, 07 Sep 2007 00:38:21 -0700

Hi,

On 8/31/07, misc <[EMAIL PROTECTED]> wrote:
>
> Hello-
>
>     I am almost certain I have found a nasty bug with nutch genereate.
>
>     Problem: Nutch generate can take many hours, even a day to complete (on a 
> crawldb that has less than 2 million urls).
>
>     I added debug code to Generator->Selector.map to see when map is called 
> and returns, and observed interesting behavior, described here:
>
>     1. Most of the time, when generate is run urls are processed in chunky 
> batches, usually about 40 at a time, followed by a 1 second delay.  I timed 
> the delay, and it really is a 1 second delay (ie- 30 batches was 30 seconds.) 
>  When this happens it takes hours to complete.
>
>     2. Sometimes (randomly as far as I can tell) when I run nutch, the urls 
> are processed without delays.  It is an all or nothing event, either I run 
> and all urls process quickly without delay (in minutes), or more likely I get 
> the chunky processing with many 1 second delays and the program takes hours 
> to end.  The one exception is....
>
>     3. When the processing runs quickly I've seen the main thread end (I have 
> some profiling going, so I know when a thread ends), and then more likely 
> than not a second thread begins where the first starts, chunky like usual.  
> Although I sometimes can get fast processing in one thread, it is almost 
> impossible for me te get it in all threads and therefore general processing 
> is very slow (hours).
>
>     4. I tried to put in more debug code to find the line where the delays 
> occured, but the last line printed to the log at a delay seemed random, 
> leading me to believe that the log is not being flushed uniformly.
>
>     5. The profiler I used seemed to imply that about 100% of the time was 
> spent in javallang.Thread.sleep.  I am not completely familiar with the 
> profiler I used so I am not completely sure I inturpreted this correctly.
>
>     I will keep debugging here, but perhaps someone here has some insight 
> into what might be happening?


Others have also reported a problem with generate performance. It
seems we have a problem here but I can not reproduce this behaviour so
I am not sure what causes it. Can you open a JIRA issue and enter your
comments there? Also, how you are running generate will be very
helpful (what is generate.max.per.host? what is -topN argument, etc.)

>
>                         thanks
>                             -J


-- 
Doğacan Güney

Re: bug with generate performance

Reply via email to