Matei Zaharia wrote:
An update.. I noticed that I hadn't specified -numFetchers as a
command line argument, so I tried setting that to 10, but even then I
end up with 100-200 pages for most fetchers and 9000 for one of them.
Matei Zaharia wrote:
I have my hosts per fetcher, threads per host, and fetcher delay set
as follows:
<property>
<name>fetcher.threads.fetch</name>
<value>25</value>
<description>The number of FetcherThreads the fetcher should use.
This is also determines the maximum number of requests that are
made at once (each FetcherThread handles one
connection).</description>
</property>
<property>
<name>fetcher.threads.per.host</name>
<value>150</value>
<description>This number is the maximum number of threads that
should be allowed to access a host at one time.</description>
</property>
<property>
<name>fetcher.server.delay</name>
<value>1.0</value>
<description>The number of seconds the fetcher will delay between
successive requests to the same server.</description>
</property>
However, no matter what I do, I only get 2 mappers.
Matei
On Nov 10, 2007, at 12:18 PM, Sebastian Steinmetz wrote:
Well,
without knowing your configuration it's a bit hard to tell, but i
think, you may have set "fetcher.threads.per.host" too low (2 maybe?)
hope it helps,
Sebastian Steinmetz
Am 10.11.2007 um 20:57 schrieb Matei Zaharia:
Hi,
I am using Nutch to index about 1 million static HTML pages on a
single server on my LAN, using a cluster of ~20 machines. However,
whenever I perform a fetch, Nutch only uses two map workers despite
the fact that there are 20 in the cluster and ends up giving 90% of
the pages to one of them. For example, I created a fetchlist of
10,000 pages and ended up with one mapper fetching 175 of them and
one fetching 9000. What can I do to use more mappers and partition
the load more evenly? My web server should be able to handle more
connections at once.
Thanks,
Matei Zaharia
Hey ,
What kind of patterns do the url have ??
This is my wild guess: you have a limited (surely less than 20) set of
domains for the complete set of urls.
HashPartitioner , which partitions the urls based on domains is the
class to look at.
And if this is true, you will have to write a custom Partitioner
--
This message has been scanned for viruses and
dangerous content and is believed to be clean.