[ 
https://issues.apache.org/jira/browse/NUTCH-503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12509059
 ] 

Doğacan Güney commented on NUTCH-503:
-------------------------------------

>  I don't know how to write a test case to cover this particular bug. Any 
> thoughts?

Normally, you would update TestGenerator by generating a couple of urls then 
showing that even though other parts contain urls first one does not (So, nutch 
would fail this test case without your patch).

However, this bug only occurs in a distributed setup, but our test cases work 
in single machine setup (by default). Hadoop does have something called 
MiniMRCluster which (I think) allows you to run distributed tests. This class 
comes from hadoop's test jar which we don't have.

Since your patch is (hopefully:) obviously true, we can skip writing a unit 
case for this one. But we really need some sort of mechanism to run our tests 
in a distributed setup.

> Generator exits incorrectly for small fetchlists 
> -------------------------------------------------
>
>                 Key: NUTCH-503
>                 URL: https://issues.apache.org/jira/browse/NUTCH-503
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 0.8, 0.8.1, 0.9.0
>         Environment: Fedora Core 2, JDK 1.6
>            Reporter: Vishal Shah
>             Fix For: 0.8.2
>
>         Attachments: emptyfetchlist.patch, emptyfetchlist.patch
>
>
>    I think I found the reason why the generator returns with an empty 
> fetchlist for small fetchsizes. 
>  
>    After the first job finishes running, the generator checks the following 
> condition to see if it got an empty list:
>  
>     if (readers == null || readers.length == 0 || !readers[0].next(new
> FloatWritable())) {
>  
>   The third condition is incorrect here. In some cases, esp. for small 
> fetchlists, the first partition might be empty, but some other partition(s) 
> might contain urls. In this case, the Generator is incorrectly assuming that 
> all partitions are empty by just looking at the first. This problem could 
> also occur when all URLs in the fetchlist are from the same host (or from a 
> very small number of hosts, or from a number of hosts that all map to a small 
> number of partitions).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to