> Hum... I use the max urls and sets it to 600... Because in the worst > case you have 6s (measured on logs) in between urls of same host: so 6 > x 600= 3600 s = 1 hour. In the worst case the long tail shouldn't last > longer than 1hour... Unfortunately it is not what I see
that's assuming that all input URLs are read at once and put into their corresponding queue and are ready to be fetched. in reality there is a cap on the amount of URLs stored in the queues (see fetchQueues.totalSize in the logs) which is equal to 50 * number of threads. the value of 50 is fixed but we could add a parameter to modify it. a workaround is simply to use more threads to increase the number of URLs stored in the queues. if you look at the logs you'll see that there are often situations where the fetchQueues.totalSize is at the max value allowed but not all fetcher threads are active which means that one or more queues prevent new URLs to be put in the queues by being large and filling up the fetchQueues.totalSize. we can't read ahead the URL entries given to the mapper without having to store them somewhat so the easiest option is probably to allow a custom multiplication factor for the fetchQueues.totalSize cap and make so that it could be more than 50. That would increase the memory consumption a bit but definitely make the fetching rate a bit more constant. You can also simply use more threads but there would be a risk of getting time outs if you specify too large a value. makes sense? > > > I also tried the " by.ip" option, because some blogs site allocate a > different domain name for each user... I saw no improvements ip resolution is quite slow because is it not multithreaded so that would not help anyway Julien > > I look at the time limit feature as a workaround this nbre host issue > and was thinking that there could be a more structural way to solve it > > 2009/12/3, Andrzej Bialecki <[email protected]>: > > MilleBii wrote: > >> Oops continuing previous mail. > >> > >> So I wonder if there would be a better algorithm 'generate' which > >> would maintain a constant rate of host per 100 url ... Below a certain > >> threshold it stops or better starts including URLs of lower scores. > > > > That's exactly how the max.urls.per.host limit works. > > > >> > >> Using scores is de-optimzing the fetching process... Having said that > >> I should first read the code and try to understand it. > > > > That wouldn't hurt in any case ;) > > > > There is also a method in ScoringFilter-s (e.g. the default > > scoring-opic), where it determines the priority of URL during > > generation. See ScoringFilter.generatorSortValue(..), you can modify > > this method in scoring-opic (or in your own scoring filter) to > > prioritize certain urls over others. > > > > -- > > Best regards, > > Andrzej Bialecki <>< > > ___. ___ ___ ___ _ _ __________________________________ > > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > > ___|||__|| \| || | Embedded Unix, System Integration > > http://www.sigram.com Contact: info at sigram dot com > > > > > > > -- > -MilleBii- -- DigitalPebble Ltd http://www.digitalpebble.com
