2008/9/19 Edward Quick <[EMAIL PROTECTED]>: > > Also forgot to mention, what should mapred.map.tasks and mapred.reduce.tasks > be set to? >
I haven't run fetcher in distributed mode for a while, but back then, fetcher would run as many map tasks as there are parts under crawl_generate. So, maybe this has changed. Anyway, try setting mapred.map.tasks to 3 as well for fetching. I think that may work. > Thanks, > > Ed. > > From: [EMAIL PROTECTED] > To: [email protected] > Subject: RE: running fetches in hadoop > Date: Thu, 18 Sep 2008 19:36:45 +0000 > > > > > > > > > > >> >> 2008/9/18 Edward Quick <[EMAIL PROTECTED]>: >> > >> > Thanks Doğacan, >> > >> > I set numFetchers but only see the fetch being done from one host at one >> > time, not all at the same time. >> > This is what I ran: >> > >> > -bash-3.00$ bin/nutch generate crawl/crawldb crawl/segments -numFetchers 3 >> > Generator: Selecting best-scoring urls due for fetch. >> > Generator: starting >> > Generator: segment: crawl/segments/20080918173443 >> > Generator: filtering: true >> > Generator: Partitioning selected urls by host, for politeness. >> > Generator: done. >> > -bash-3.00$ bin/nutch fetch crawl/segments/20080918173443 >> > Fetcher: starting >> > Fetcher: segment: crawl/segments/20080918173443 >> > >> >> Hmm, how many parts are under crawl/segments/20080918173443/crawl_generate? > > -bash-3.00$ bin/hadoop dfs -ls crawl/segments/20080918173443/crawl_generate > Found 3 items > /user/nutch/crawl/segments/20080918173443/crawl_generate/part-00000 <r 1> > 86 2008-09-18 17:35 rw-r--r-- nutch supergroup > /user/nutch/crawl/segments/20080918173443/crawl_generate/part-00001 <r 1> > 86 2008-09-18 17:35 rw-r--r-- nutch supergroup > /user/nutch/crawl/segments/20080918173443/crawl_generate/part-00002 <r 1> > 442915 2008-09-18 17:35 rw-r--r-- nutch supergroup > -bash-3.00$ > > This is what I have set in nutch-site.xml remembering I have 3 hosts: > fetcher.server.delay 0.01 > fetcher.threads.fetch 10 > fetcher.threads.per.host 30 > >> >> > >> > >> > >> >> Date: Thu, 18 Sep 2008 18:34:26 +0300 >> >> From: [EMAIL PROTECTED] >> >> To: [email protected] >> >> Subject: Re: running fetches in hadoop >> >> >> >> Hi, >> >> >> >> On Thu, Sep 18, 2008 at 5:23 PM, Edward Quick <[EMAIL PROTECTED]> wrote: >> >> > >> >> > I have 3 hosts in a hadoop cluster and noticed that the fetch only runs >> >> > from one host at a time. >> >> > Is that right or should the fetch run from all 3 hosts at the same time? >> >> > >> >> >> >> Try running generate like this: >> >> >> >> bin/nutch generate <other options> -numFetchers 3 >> >> >> >> > Thanks, >> >> > >> >> > Ed. >> >> > >> >> > _________________________________________________________________ >> >> > Discover Bird's Eye View now with Multimap from Live Search >> >> > http://clk.atdmt.com/UKM/go/111354026/direct/01/ >> >> >> >> >> >> >> >> -- >> >> Doğacan Güney >> > >> > _________________________________________________________________ >> > Discover Bird's Eye View now with Multimap from Live Search >> > http://clk.atdmt.com/UKM/go/111354026/direct/01/ >> >> >> >> -- >> Doğacan Güney > > Try Facebook in Windows Live Messenger! Try it Now! > > _________________________________________________________________ > Make a mini you and download it into Windows Live Messenger > http://clk.atdmt.com/UKM/go/111354029/direct/01/ -- Doğacan Güney
