Hello-

    I am almost certain I have found a nasty bug with nutch genereate.

    Problem: Nutch generate can take many hours, even a day to complete (on a 
crawldb that has less than 2 million urls).

    I added debug code to Generator->Selector.map to see when map is called and 
returns, and observed interesting behavior, described here:

    1. Most of the time, when generate is run urls are processed in chunky 
batches, usually about 40 at a time, followed by a 1 second delay.  I timed the 
delay, and it really is a 1 second delay (ie- 30 batches was 30 seconds.)  When 
this happens it takes hours to complete.

    2. Sometimes (randomly as far as I can tell) when I run nutch, the urls are 
processed without delays.  It is an all or nothing event, either I run and all 
urls process quickly without delay (in minutes), or more likely I get the 
chunky processing with many 1 second delays and the program takes hours to end. 
 The one exception is....

    3. When the processing runs quickly I've seen the main thread end (I have 
some profiling going, so I know when a thread ends), and then more likely than 
not a second thread begins where the first starts, chunky like usual.  Although 
I sometimes can get fast processing in one thread, it is almost impossible for 
me te get it in all threads and therefore general processing is very slow 
(hours).

    4. I tried to put in more debug code to find the line where the delays 
occured, but the last line printed to the log at a delay seemed random, leading 
me to believe that the log is not being flushed uniformly.

    5. The profiler I used seemed to imply that about 100% of the time was 
spent in javallang.Thread.sleep.  I am not completely familiar with the 
profiler I used so I am not completely sure I inturpreted this correctly.

    I will keep debugging here, but perhaps someone here has some insight into 
what might be happening?

                        thanks
                            -J 

Reply via email to