[Nutch-dev] [jira] Commented: (NUTCH-246) segment size is never as big as topN or crawlDB size in a distributed deployement

Chris Schneider (JIRA) Tue, 11 Apr 2006 08:14:08 -0700

    [ 
http://issues.apache.org/jira/browse/NUTCH-246?page=comments#action_12374049 ]


Chris Schneider commented on NUTCH-246:
---------------------------------------

A few more details:

Stefan and I were able to reproduce this problem using either an injection set 
of 4500 URLs or a larger set of DMOZ URLs. With the 4500 URL injection, only 
653 URLs were generated for the first segment, despite the fact that topN was 
set to 500K. I confirmed that nearly all of the 4500 injected URLs passed our 
URL filer and were actually injected into the crawldb.

To eliminate the possibility that the bug had been fixed recently or was due to 
a code modification that we'd made ourselves, we deployed yesterday's sandbox 
version of nutch (2006-04-10), including hadoop-0.1.1.jar (though I believe 
that Stefan had to build it himself because the nutch-0.8-dev.jar didn't match 
the source). We made the absolute minimum changes to nutch-site.xml, 
hadoop-site.xml, and hadoop-env.sh in order to deploy this version properly in 
our cluster (1 jobtracker/namenode machine, 10 tasktracker/datanode machines). 
However, we got the same results (i.e., very few URLs actually generated).

This bug has apparently been present since at least change 382948, but I 
suspect that it may have been present for the entire history of the mapreduce 
implementation of Nutch. It may also be the root cause of NUTCH-136, the 
explanation for which has always left me somewhat dissatisfied. Just because a 
nutch-site.xml containing default properties may override the desired mapred 
properties (incorrectly) specified in one of the *-default.xml files, and may 
therefore set mapred.map.tasks and mapred.reduce.tasks back to the defaults (2 
and 1, respectively), it's not clear to me exactly how/why you'd get only a 
fraction of topN URLs fetched. As Stefan has suggested, it would actually seem 
more plausible if each tasktracker tried to fetch the entire set of URLs in 
this case.

I would suggest that someone with a good understanding of the hadoop 
implementation investigate the first generation job in fine detail, both for 
the case where the mapred properties are specified in an appropriate manner and 
for the case where nutch-site.xml overrides the desired properties, setting 
them back to the defaults.

> segment size is never as big as topN or crawlDB size in a distributed 
> deployement
> ---------------------------------------------------------------------------------
>
>          Key: NUTCH-246
>          URL: http://issues.apache.org/jira/browse/NUTCH-246
>      Project: Nutch
>         Type: Bug

>     Versions: 0.8-dev
>     Reporter: Stefan Groschupf
>     Priority: Blocker
>      Fix For: 0.8-dev

>
> I didn't reopen NUTCH-136 since it is may related to the hadoop split.
> I tested this on two different deployement (with 10 ttrackers + 1 jobtracker 
> and 9 ttracks and 1 jobtracker).
> Defining map and reduce task number in a mapred-default.xml does not solve 
> the problem. (is in nutch/conf on all boxes)
> We verified that it is not  a problem of maximum urls per hosts and also not 
> a problem of the url filter.
> Looks like the first job of the Generator (Selector) already got to less 
> entries to process. 
> May be this is somehow releasted to split generation or configuration inside 
> the distributed jobtracker since it runs in a different jvm as the jobclient.
> However we was not able to find the source for this problem.
> I think that should be fixed before  publishing a nutch 0.8. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] [jira] Commented: (NUTCH-246) segment size is never as big as topN or crawlDB size in a distributed deployement

Reply via email to