David Bargeron wrote:
Thanks. How can I determine how many unique hosts there are in my fetchlists? And if it turns out there are not many unique hosts, can I force Nutch to favor many unique hosts?
You can dump the generated fetchlist (see bin/nutch readseg - you need to exclude missing segment parts) and then use regular Unix tools to prepare this list.
You can also limit the number of urls per host - see the property generate.max.per.host in nutch-default.xml. Please note that this may drastically decrease the final number of generated urls in a segment, so that it's significantly lower than the target topN number.
-- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
