On 6/21/07, Vishal Shah <[EMAIL PROTECTED]> wrote:
> Hi,
>
>    I think I found the reason why the generator returns with an empty
> fetchlist for small fetchsizes.
>
>    After the first job finishes running, the generator checks the following
> condition to see if it got an empty list:
>
>     if (readers == null || readers.length == 0 || !readers[0].next(new
> FloatWritable())) {
>
>   The third condition is incorrect here. In some cases, esp. for small
> fetchlists, the first partition might be empty, but some other partition(s)
> might contain urls. In this case, the Generator is incorrectly assuming that
> all partitions are empty by just looking at the first. This problem could
> also occur when all URLs in the fetchlist are from the same host (or from a
> very small number of hosts, or from a number of hosts that all map to a
> small number of partitions).
>
>   I fixed this problem by replacing the following code:
>
>     // check that we selected at least some entries ...
>     SequenceFile.Reader[] readers = SequenceFileOutputFormat.getReaders(job,
> tempDir);
>     if (readers == null || readers.length == 0 || !readers[0].next(new
> FloatWritable())) {
>       LOG.warn("Generator: 0 records selected for fetching, exiting ...");
>       LockUtil.removeLockFile(fs, lock);
>       fs.delete(tempDir);
>       return null;
>     }
>
> With the following code:
>
>    // check that we selected at least some entries ...
>     SequenceFile.Reader[] readers = SequenceFileOutputFormat.getReaders(job,
> tempDir);
>     boolean empty = true;
>     if (readers != null && readers.length > 0) {
>             for (int num=0; num<readers.length; num++){
>                         if (readers[num].next(new FloatWritable())) {
>                                     empty = false;
>                                     break;
>                         }
>             }
>     }
>     if (empty) {
>       LOG.warn("Generator: 0 records selected for fetching, exiting ...");
>       LockUtil.removeLockFile(fs, lock);
>       fs.delete(tempDir);
>       return null;
>     }
>
> This seems to do the trick.

Nice catch. Can you open a JIRA issue and attach a patch there?

>
> Regards,
>
> -vishal.
>


-- 
Doğacan Güney
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to