Hum... I use the max urls and sets it to 600... Because in the worst
case you have 6s (measured on logs) in between urls of same host: so 6
x 600= 3600 s = 1 hour. In the worst case the long tail shouldn't last
longer than 1hour... Unfortunately it is not what I see

I also tried the " by.ip"  option, because some blogs site allocate a
different domain name for each user... I saw no improvements

I look at the time limit feature as a workaround this nbre host issue
and was thinking that there could be a more structural way to solve it

2009/12/3, Andrzej Bialecki <[email protected]>:
> MilleBii wrote:
>> Oops continuing previous mail.
>>
>> So I wonder if there would be a better  algorithm 'generate' which
>> would maintain a constant rate of host per 100 url ... Below a certain
>> threshold it stops or better starts including URLs of lower scores.
>
> That's exactly how the max.urls.per.host limit works.
>
>>
>> Using scores is de-optimzing the fetching process... Having said that
>> I should first read the code and try to understand it.
>
> That wouldn't hurt in any case ;)
>
> There is also a method in ScoringFilter-s (e.g. the default
> scoring-opic), where it determines the priority of URL during
> generation. See ScoringFilter.generatorSortValue(..), you can modify
> this method in scoring-opic (or in your own scoring filter) to
> prioritize certain urls over others.
>
> --
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>


-- 
-MilleBii-

Reply via email to