Re: How does generate work ?

MilleBii Thu, 03 Dec 2009 13:15:36 -0800

Well I increased dramatically the number of threads, empirically the
best I have found is around 1200 threads. This actually means 2400
because I have two mappers running at once (looking at hadoop logs).


The bandwidth still gets a 'L' shape... Although a lot higher and a bit thicker.

On the run before last the shape was a bit different with a linear
decrease from full speed to long tail... I interpreted it as number of
host is gradually decreasing.

Its true that I forgot about these parallel queue/threads,
Now if it could happen that becaus of the URL mix the queues start to
all query the same host (blog domain as I expanied earlier)... The
site will protect against DOS and only answer at 5 s' or so. At which
point the parallel queues look like a serial one, hence the long tail
???
Although timeouts should clear queues from blocking URLs ... Where do
you look at timeouts I never saw one  in the middle of fetch ???


2009/12/3, Julien Nioche <[email protected]>:
>> Hum... I use the max urls and sets it to 600... Because in the worst
>> case you have 6s (measured on logs) in between urls of same host: so 6
>> x 600= 3600 s = 1 hour. In the worst case the long tail shouldn't last
>> longer than 1hour... Unfortunately it is not what I see
>
>
> that's assuming that all input URLs are read at once and put into
> their corresponding queue and are ready to be fetched. in reality
> there is a cap on the amount of URLs stored in the queues (see
> fetchQueues.totalSize in the logs) which is equal to 50 * number of
> threads.
>
> the value of 50 is fixed but we could add a parameter to modify it. a
> workaround is simply to use more threads to increase the number of
> URLs stored in the queues.
>
> if you look at the logs you'll see that there are often situations
> where the fetchQueues.totalSize is at the max value allowed but not
> all fetcher threads are active which means that one or more queues
> prevent new URLs to be put in the queues by being large and filling up
> the fetchQueues.totalSize.
>
> we can't read ahead the URL entries given to the mapper without having
> to store them somewhat so the easiest option is probably to allow a
> custom multiplication factor for the  fetchQueues.totalSize cap and
> make so that it could be more than 50. That would increase the memory
> consumption a bit but definitely make the fetching rate a bit more
> constant. You can also simply use more threads but there would be a
> risk of getting time outs if you specify too large a value.
>
> makes sense?
>
>>
>>
>> I also tried the " by.ip"  option, because some blogs site allocate a
>> different domain name for each user... I saw no improvements
>
> ip resolution is quite slow because is it not multithreaded so that
> would not help anyway
>
> Julien
>
>>
>> I look at the time limit feature as a workaround this nbre host issue
>> and was thinking that there could be a more structural way to solve it
>>
>> 2009/12/3, Andrzej Bialecki <[email protected]>:
>> > MilleBii wrote:
>> >> Oops continuing previous mail.
>> >>
>> >> So I wonder if there would be a better  algorithm 'generate' which
>> >> would maintain a constant rate of host per 100 url ... Below a certain
>> >> threshold it stops or better starts including URLs of lower scores.
>> >
>> > That's exactly how the max.urls.per.host limit works.
>> >
>> >>
>> >> Using scores is de-optimzing the fetching process... Having said that
>> >> I should first read the code and try to understand it.
>> >
>> > That wouldn't hurt in any case ;)
>> >
>> > There is also a method in ScoringFilter-s (e.g. the default
>> > scoring-opic), where it determines the priority of URL during
>> > generation. See ScoringFilter.generatorSortValue(..), you can modify
>> > this method in scoring-opic (or in your own scoring filter) to
>> > prioritize certain urls over others.
>> >
>> > --
>> > Best regards,
>> > Andrzej Bialecki     <><
>> >   ___. ___ ___ ___ _ _   __________________________________
>> > [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>> > ___|||__||  \|  ||  |  Embedded Unix, System Integration
>> > http://www.sigram.com  Contact: info at sigram dot com
>> >
>> >
>>
>>
>> --
>> -MilleBii-
>
>
>
>
> --
> DigitalPebble Ltd
> http://www.digitalpebble.com
>


-- 
-MilleBii-

Re: How does generate work ?

Reply via email to