On 6/21/07, Vishal Shah <[EMAIL PROTECTED]> wrote: > Hi, > > I think I found the reason why the generator returns with an empty > fetchlist for small fetchsizes. > > After the first job finishes running, the generator checks the following > condition to see if it got an empty list: > > if (readers == null || readers.length == 0 || !readers[0].next(new > FloatWritable())) { > > The third condition is incorrect here. In some cases, esp. for small > fetchlists, the first partition might be empty, but some other partition(s) > might contain urls. In this case, the Generator is incorrectly assuming that > all partitions are empty by just looking at the first. This problem could > also occur when all URLs in the fetchlist are from the same host (or from a > very small number of hosts, or from a number of hosts that all map to a > small number of partitions). > > I fixed this problem by replacing the following code: > > // check that we selected at least some entries ... > SequenceFile.Reader[] readers = SequenceFileOutputFormat.getReaders(job, > tempDir); > if (readers == null || readers.length == 0 || !readers[0].next(new > FloatWritable())) { > LOG.warn("Generator: 0 records selected for fetching, exiting ..."); > LockUtil.removeLockFile(fs, lock); > fs.delete(tempDir); > return null; > } > > With the following code: > > // check that we selected at least some entries ... > SequenceFile.Reader[] readers = SequenceFileOutputFormat.getReaders(job, > tempDir); > boolean empty = true; > if (readers != null && readers.length > 0) { > for (int num=0; num<readers.length; num++){ > if (readers[num].next(new FloatWritable())) { > empty = false; > break; > } > } > } > if (empty) { > LOG.warn("Generator: 0 records selected for fetching, exiting ..."); > LockUtil.removeLockFile(fs, lock); > fs.delete(tempDir); > return null; > } > > This seems to do the trick.
Nice catch. Can you open a JIRA issue and attach a patch there? > > Regards, > > -vishal. > -- Doğacan Güney ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers