Matei Zaharia wrote:
An update.. I noticed that I hadn't specified -numFetchers as a command line argument, so I tried setting that to 10, but even then I end up with 100-200 pages for most fetchers and 9000 for one of them.

Matei Zaharia wrote:
I have my hosts per fetcher, threads per host, and fetcher delay set as follows:

<property>
  <name>fetcher.threads.fetch</name>
  <value>25</value>
  <description>The number of FetcherThreads the fetcher should use.
    This is also determines the maximum number of requests that are
made at once (each FetcherThread handles one connection).</description>
</property>

<property>
  <name>fetcher.threads.per.host</name>
  <value>150</value>
  <description>This number is the maximum number of threads that
    should be allowed to access a host at one time.</description>
</property>

<property>
  <name>fetcher.server.delay</name>
  <value>1.0</value>
  <description>The number of seconds the fetcher will delay between
   successive requests to the same server.</description>
</property>


However, no matter what I do, I only get 2 mappers.

Matei

On Nov 10, 2007, at 12:18 PM, Sebastian Steinmetz wrote:

Well,

without knowing your configuration it's a bit hard to tell, but i think, you may have set "fetcher.threads.per.host" too low (2 maybe?)

hope it helps,
    Sebastian Steinmetz


Am 10.11.2007 um 20:57 schrieb Matei Zaharia:

Hi,

I am using Nutch to index about 1 million static HTML pages on a single server on my LAN, using a cluster of ~20 machines. However, whenever I perform a fetch, Nutch only uses two map workers despite the fact that there are 20 in the cluster and ends up giving 90% of the pages to one of them. For example, I created a fetchlist of 10,000 pages and ended up with one mapper fetching 175 of them and one fetching 9000. What can I do to use more mappers and partition the load more evenly? My web server should be able to handle more connections at once.

Thanks,

Matei Zaharia




Hey ,

What kind of patterns do the url have ??

This is my wild guess: you have a limited (surely less than 20) set of domains for the complete set of urls. HashPartitioner , which partitions the urls based on domains is the class to look at.
And if this is true, you will have to write a custom Partitioner




--
This message has been scanned for viruses and
dangerous content and is believed to be clean.

Reply via email to