David Bargeron wrote:
Thanks. How can I determine how many unique hosts there are in my
fetchlists? And if it turns out there are not many unique hosts, can I force
Nutch to favor many unique hosts?

You can dump the generated fetchlist (see bin/nutch readseg - you need to exclude missing segment parts) and then use regular Unix tools to prepare this list.

You can also limit the number of urls per host - see the property generate.max.per.host in nutch-default.xml. Please note that this may drastically decrease the final number of generated urls in a segment, so that it's significantly lower than the target topN number.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to