Well I increased dramatically the number of threads, empirically the best I have found is around 1200 threads. This actually means 2400 because I have two mappers running at once (looking at hadoop logs).
The bandwidth still gets a 'L' shape... Although a lot higher and a bit thicker. On the run before last the shape was a bit different with a linear decrease from full speed to long tail... I interpreted it as number of host is gradually decreasing. Its true that I forgot about these parallel queue/threads, Now if it could happen that becaus of the URL mix the queues start to all query the same host (blog domain as I expanied earlier)... The site will protect against DOS and only answer at 5 s' or so. At which point the parallel queues look like a serial one, hence the long tail ??? Although timeouts should clear queues from blocking URLs ... Where do you look at timeouts I never saw one in the middle of fetch ??? 2009/12/3, Julien Nioche <[email protected]>: >> Hum... I use the max urls and sets it to 600... Because in the worst >> case you have 6s (measured on logs) in between urls of same host: so 6 >> x 600= 3600 s = 1 hour. In the worst case the long tail shouldn't last >> longer than 1hour... Unfortunately it is not what I see > > > that's assuming that all input URLs are read at once and put into > their corresponding queue and are ready to be fetched. in reality > there is a cap on the amount of URLs stored in the queues (see > fetchQueues.totalSize in the logs) which is equal to 50 * number of > threads. > > the value of 50 is fixed but we could add a parameter to modify it. a > workaround is simply to use more threads to increase the number of > URLs stored in the queues. > > if you look at the logs you'll see that there are often situations > where the fetchQueues.totalSize is at the max value allowed but not > all fetcher threads are active which means that one or more queues > prevent new URLs to be put in the queues by being large and filling up > the fetchQueues.totalSize. > > we can't read ahead the URL entries given to the mapper without having > to store them somewhat so the easiest option is probably to allow a > custom multiplication factor for the fetchQueues.totalSize cap and > make so that it could be more than 50. That would increase the memory > consumption a bit but definitely make the fetching rate a bit more > constant. You can also simply use more threads but there would be a > risk of getting time outs if you specify too large a value. > > makes sense? > >> >> >> I also tried the " by.ip" option, because some blogs site allocate a >> different domain name for each user... I saw no improvements > > ip resolution is quite slow because is it not multithreaded so that > would not help anyway > > Julien > >> >> I look at the time limit feature as a workaround this nbre host issue >> and was thinking that there could be a more structural way to solve it >> >> 2009/12/3, Andrzej Bialecki <[email protected]>: >> > MilleBii wrote: >> >> Oops continuing previous mail. >> >> >> >> So I wonder if there would be a better algorithm 'generate' which >> >> would maintain a constant rate of host per 100 url ... Below a certain >> >> threshold it stops or better starts including URLs of lower scores. >> > >> > That's exactly how the max.urls.per.host limit works. >> > >> >> >> >> Using scores is de-optimzing the fetching process... Having said that >> >> I should first read the code and try to understand it. >> > >> > That wouldn't hurt in any case ;) >> > >> > There is also a method in ScoringFilter-s (e.g. the default >> > scoring-opic), where it determines the priority of URL during >> > generation. See ScoringFilter.generatorSortValue(..), you can modify >> > this method in scoring-opic (or in your own scoring filter) to >> > prioritize certain urls over others. >> > >> > -- >> > Best regards, >> > Andrzej Bialecki <>< >> > ___. ___ ___ ___ _ _ __________________________________ >> > [__ || __|__/|__||\/| Information Retrieval, Semantic Web >> > ___|||__|| \| || | Embedded Unix, System Integration >> > http://www.sigram.com Contact: info at sigram dot com >> > >> > >> >> >> -- >> -MilleBii- > > > > > -- > DigitalPebble Ltd > http://www.digitalpebble.com > -- -MilleBii-
