Re: How does generate work ?

Julien Nioche Thu, 03 Dec 2009 11:19:34 -0800

> Hum... I use the max urls and sets it to 600... Because in the worst
> case you have 6s (measured on logs) in between urls of same host: so 6
> x 600= 3600 s = 1 hour. In the worst case the long tail shouldn't last
> longer than 1hour... Unfortunately it is not what I see

that's assuming that all input URLs are read at once and put into
their corresponding queue and are ready to be fetched. in reality
there is a cap on the amount of URLs stored in the queues (see
fetchQueues.totalSize in the logs) which is equal to 50 * number of
threads.

the value of 50 is fixed but we could add a parameter to modify it. a
workaround is simply to use more threads to increase the number of
URLs stored in the queues.

if you look at the logs you'll see that there are often situations
where the fetchQueues.totalSize is at the max value allowed but not
all fetcher threads are active which means that one or more queues
prevent new URLs to be put in the queues by being large and filling up
the fetchQueues.totalSize.

we can't read ahead the URL entries given to the mapper without having
to store them somewhat so the easiest option is probably to allow a
custom multiplication factor for the  fetchQueues.totalSize cap and
make so that it could be more than 50. That would increase the memory
consumption a bit but definitely make the fetching rate a bit more
constant. You can also simply use more threads but there would be a
risk of getting time outs if you specify too large a value.

makes sense?

>
>
> I also tried the " by.ip"  option, because some blogs site allocate a
> different domain name for each user... I saw no improvements

ip resolution is quite slow because is it not multithreaded so that
would not help anyway

Julien

>
> I look at the time limit feature as a workaround this nbre host issue
> and was thinking that there could be a more structural way to solve it
>
> 2009/12/3, Andrzej Bialecki <[email protected]>:
> > MilleBii wrote:
> >> Oops continuing previous mail.
> >>
> >> So I wonder if there would be a better  algorithm 'generate' which
> >> would maintain a constant rate of host per 100 url ... Below a certain
> >> threshold it stops or better starts including URLs of lower scores.
> >
> > That's exactly how the max.urls.per.host limit works.
> >
> >>
> >> Using scores is de-optimzing the fetching process... Having said that
> >> I should first read the code and try to understand it.
> >
> > That wouldn't hurt in any case ;)
> >
> > There is also a method in ScoringFilter-s (e.g. the default
> > scoring-opic), where it determines the priority of URL during
> > generation. See ScoringFilter.generatorSortValue(..), you can modify
> > this method in scoring-opic (or in your own scoring filter) to
> > prioritize certain urls over others.
> >
> > --
> > Best regards,
> > Andrzej Bialecki     <><
> >   ___. ___ ___ ___ _ _   __________________________________
> > [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> > ___|||__||  \|  ||  |  Embedded Unix, System Integration
> > http://www.sigram.com  Contact: info at sigram dot com
> >
> >
>
>
> --
> -MilleBii-

-- 
DigitalPebble Ltd
http://www.digitalpebble.com

Re: How does generate work ?

Reply via email to