Hum... I use the max urls and sets it to 600... Because in the worst case you have 6s (measured on logs) in between urls of same host: so 6 x 600= 3600 s = 1 hour. In the worst case the long tail shouldn't last longer than 1hour... Unfortunately it is not what I see
I also tried the " by.ip" option, because some blogs site allocate a different domain name for each user... I saw no improvements I look at the time limit feature as a workaround this nbre host issue and was thinking that there could be a more structural way to solve it 2009/12/3, Andrzej Bialecki <[email protected]>: > MilleBii wrote: >> Oops continuing previous mail. >> >> So I wonder if there would be a better algorithm 'generate' which >> would maintain a constant rate of host per 100 url ... Below a certain >> threshold it stops or better starts including URLs of lower scores. > > That's exactly how the max.urls.per.host limit works. > >> >> Using scores is de-optimzing the fetching process... Having said that >> I should first read the code and try to understand it. > > That wouldn't hurt in any case ;) > > There is also a method in ScoringFilter-s (e.g. the default > scoring-opic), where it determines the priority of URL during > generation. See ScoringFilter.generatorSortValue(..), you can modify > this method in scoring-opic (or in your own scoring filter) to > prioritize certain urls over others. > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > > -- -MilleBii-
