Doğacan Güney wrote:
2008/9/19 Edward Quick <[EMAIL PROTECTED]>:
Also forgot to mention, what should mapred.map.tasks and mapred.reduce.tasks be
set to?
I haven't run fetcher in distributed mode for a while, but back then,
fetcher would run as many map tasks as there are
parts under crawl_generate. So, maybe this has changed. Anyway, try
setting mapred.map.tasks to 3 as well for fetching.
It didn't change, and that's not the issue here. Look at the size of the
parts:
-bash-3.00$ bin/hadoop dfs -ls crawl/segments/20080918173443/crawl_generate
Found 3 items
/user/nutch/crawl/segments/20080918173443/crawl_generate/part-00000 <r 1>
86 2008-09-18 17:35 rw-r--r-- nutch supergroup
/user/nutch/crawl/segments/20080918173443/crawl_generate/part-00001 <r 1>
86 2008-09-18 17:35 rw-r--r-- nutch supergroup
/user/nutch/crawl/segments/20080918173443/crawl_generate/part-00002 <r 1>
442915 2008-09-18 17:35 rw-r--r-- nutch supergroup
The "problem" is that parts 0 and 1 contain no data (the 86 bytes is
consumed by a header of an empty SequenceFile). It's not really a
problem as such - this is most likely caused by a skewed distribution of
urls among hosts, i.e. all the urls on this fetchlist come from a single
or very few hosts (which accidentally are hashed to the same partition).
Then, when you start the fetcher, it may create 3 tasks, and 2 of them
finish their job immediately (no input data), and all the remaining urls
are handled jut by a single task.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com