Hi Dogacan,

I've uploaded the patch to Nutch-503.

http://issues.apache.org/jira/browse/NUTCH-503


Regards,

-vishal.

-----Original Message-----
From: Dogacan Güney [mailto:[EMAIL PROTECTED] 
Sent: Thursday, June 21, 2007 12:33 PM
To: nutch-dev@lucene.apache.org; [EMAIL PROTECTED]
Subject: Re: Found the bug in Generator when number of URLs is small

On 6/21/07, Vishal Shah <[EMAIL PROTECTED]> wrote:
> Hi,
>
>    I think I found the reason why the generator returns with an empty
> fetchlist for small fetchsizes.
>
>    After the first job finishes running, the generator checks the
following
> condition to see if it got an empty list:
>
>     if (readers == null || readers.length == 0 || !readers[0].next(new
> FloatWritable())) {
>
>   The third condition is incorrect here. In some cases, esp. for small
> fetchlists, the first partition might be empty, but some other
partition(s)
> might contain urls. In this case, the Generator is incorrectly assuming
that
> all partitions are empty by just looking at the first. This problem could
> also occur when all URLs in the fetchlist are from the same host (or from
a
> very small number of hosts, or from a number of hosts that all map to a
> small number of partitions).
>
>   I fixed this problem by replacing the following code:
>
>     // check that we selected at least some entries ...
>     SequenceFile.Reader[] readers =
SequenceFileOutputFormat.getReaders(job,
> tempDir);
>     if (readers == null || readers.length == 0 || !readers[0].next(new
> FloatWritable())) {
>       LOG.warn("Generator: 0 records selected for fetching, exiting ...");
>       LockUtil.removeLockFile(fs, lock);
>       fs.delete(tempDir);
>       return null;
>     }
>
> With the following code:
>
>    // check that we selected at least some entries ...
>     SequenceFile.Reader[] readers =
SequenceFileOutputFormat.getReaders(job,
> tempDir);
>     boolean empty = true;
>     if (readers != null && readers.length > 0) {
>             for (int num=0; num<readers.length; num++){
>                         if (readers[num].next(new FloatWritable())) {
>                                     empty = false;
>                                     break;
>                         }
>             }
>     }
>     if (empty) {
>       LOG.warn("Generator: 0 records selected for fetching, exiting ...");
>       LockUtil.removeLockFile(fs, lock);
>       fs.delete(tempDir);
>       return null;
>     }
>
> This seems to do the trick.

Nice catch. Can you open a JIRA issue and attach a patch there?

>
> Regards,
>
> -vishal.
>


-- 
Dogacan Güney

Reply via email to