I am also evaluating performance but on a single machine. I am finding that it crawls about two urls per second. The fetch list is mainly unique so I am looking for other performance bottlenecks. The machine is an old PIII with 512MB of RAM that is running with a load average of 3-4, so I am going to try a faster machine next week.
What details about the network or the dns setup should I find out to determine bottle necks in that area? Vince On 8/22/07, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: > > David Bargeron wrote: > > Thanks. How can I determine how many unique hosts there are in my > > fetchlists? And if it turns out there are not many unique hosts, can I > force > > Nutch to favor many unique hosts? > > You can dump the generated fetchlist (see bin/nutch readseg - you need > to exclude missing segment parts) and then use regular Unix tools to > prepare this list. > > You can also limit the number of urls per host - see the property > generate.max.per.host in nutch-default.xml. Please note that this may > drastically decrease the final number of generated urls in a segment, so > that it's significantly lower than the target topN number. > > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > >
