Hi, On 8/31/07, misc <[EMAIL PROTECTED]> wrote: > > Hello- > > I am almost certain I have found a nasty bug with nutch genereate. > > Problem: Nutch generate can take many hours, even a day to complete (on a > crawldb that has less than 2 million urls). > > I added debug code to Generator->Selector.map to see when map is called > and returns, and observed interesting behavior, described here: > > 1. Most of the time, when generate is run urls are processed in chunky > batches, usually about 40 at a time, followed by a 1 second delay. I timed > the delay, and it really is a 1 second delay (ie- 30 batches was 30 seconds.) > When this happens it takes hours to complete. > > 2. Sometimes (randomly as far as I can tell) when I run nutch, the urls > are processed without delays. It is an all or nothing event, either I run > and all urls process quickly without delay (in minutes), or more likely I get > the chunky processing with many 1 second delays and the program takes hours > to end. The one exception is.... > > 3. When the processing runs quickly I've seen the main thread end (I have > some profiling going, so I know when a thread ends), and then more likely > than not a second thread begins where the first starts, chunky like usual. > Although I sometimes can get fast processing in one thread, it is almost > impossible for me te get it in all threads and therefore general processing > is very slow (hours). > > 4. I tried to put in more debug code to find the line where the delays > occured, but the last line printed to the log at a delay seemed random, > leading me to believe that the log is not being flushed uniformly. > > 5. The profiler I used seemed to imply that about 100% of the time was > spent in javallang.Thread.sleep. I am not completely familiar with the > profiler I used so I am not completely sure I inturpreted this correctly. > > I will keep debugging here, but perhaps someone here has some insight > into what might be happening?
Others have also reported a problem with generate performance. It seems we have a problem here but I can not reproduce this behaviour so I am not sure what causes it. Can you open a JIRA issue and enter your comments there? Also, how you are running generate will be very helpful (what is generate.max.per.host? what is -topN argument, etc.) > > thanks > -J -- Doğacan Güney
