segment size is never as big as topN or crawlDB size in a distributed 
deployement
---------------------------------------------------------------------------------

         Key: NUTCH-246
         URL: http://issues.apache.org/jira/browse/NUTCH-246
     Project: Nutch
        Type: Bug

    Versions: 0.8-dev    
    Reporter: Stefan Groschupf
    Priority: Blocker
     Fix For: 0.8-dev


I didn't reopen NUTCH-136 since it is may related to the hadoop split.
I tested this on two different deployement (with 10 ttrackers + 1 jobtracker 
and 9 ttracks and 1 jobtracker).
Defining map and reduce task number in a mapred-default.xml does not solve the 
problem. (is in nutch/conf on all boxes)
We verified that it is not  a problem of maximum urls per hosts and also not a 
problem of the url filter.

Looks like the first job of the Generator (Selector) already got to less 
entries to process. 
May be this is somehow releasted to split generation or configuration inside 
the distributed jobtracker since it runs in a different jvm as the jobclient.
However we was not able to find the source for this problem.

I think that should be fixed before  publishing a nutch 0.8. 




-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to