Hi Dogacan, I've uploaded the patch to Nutch-503.
http://issues.apache.org/jira/browse/NUTCH-503 Regards, -vishal. -----Original Message----- From: Dogacan Güney [mailto:[EMAIL PROTECTED] Sent: Thursday, June 21, 2007 12:33 PM To: nutch-dev@lucene.apache.org; [EMAIL PROTECTED] Subject: Re: Found the bug in Generator when number of URLs is small On 6/21/07, Vishal Shah <[EMAIL PROTECTED]> wrote: > Hi, > > I think I found the reason why the generator returns with an empty > fetchlist for small fetchsizes. > > After the first job finishes running, the generator checks the following > condition to see if it got an empty list: > > if (readers == null || readers.length == 0 || !readers[0].next(new > FloatWritable())) { > > The third condition is incorrect here. In some cases, esp. for small > fetchlists, the first partition might be empty, but some other partition(s) > might contain urls. In this case, the Generator is incorrectly assuming that > all partitions are empty by just looking at the first. This problem could > also occur when all URLs in the fetchlist are from the same host (or from a > very small number of hosts, or from a number of hosts that all map to a > small number of partitions). > > I fixed this problem by replacing the following code: > > // check that we selected at least some entries ... > SequenceFile.Reader[] readers = SequenceFileOutputFormat.getReaders(job, > tempDir); > if (readers == null || readers.length == 0 || !readers[0].next(new > FloatWritable())) { > LOG.warn("Generator: 0 records selected for fetching, exiting ..."); > LockUtil.removeLockFile(fs, lock); > fs.delete(tempDir); > return null; > } > > With the following code: > > // check that we selected at least some entries ... > SequenceFile.Reader[] readers = SequenceFileOutputFormat.getReaders(job, > tempDir); > boolean empty = true; > if (readers != null && readers.length > 0) { > for (int num=0; num<readers.length; num++){ > if (readers[num].next(new FloatWritable())) { > empty = false; > break; > } > } > } > if (empty) { > LOG.warn("Generator: 0 records selected for fetching, exiting ..."); > LockUtil.removeLockFile(fs, lock); > fs.delete(tempDir); > return null; > } > > This seems to do the trick. Nice catch. Can you open a JIRA issue and attach a patch there? > > Regards, > > -vishal. > -- Dogacan Güney