On Mon, Mar 23, 2015 at 10:11:56AM -0600, Stephen John Smoogen wrote: > On 23 March 2015 at 09:59, Adrian Reber <[email protected]> wrote: > > > Additionally the 4GB of RAM on mm-crawler01 are not enough to > > > crawl all the mirrors in a reasonable time. Even if only > > > started with 20 crawler threads instead of 75 the 4GB are not > > > enough. > > > > This has been increased to 32GB (thanks) and I had a few test runs of the > > crawler > > over the weekend with libcurl from F21: > > > > All runs for 435 mirrors take at least 6 hours: > > > > 50 threads: > > > > http://lisas.de/~adrian/crawler-resources/2015-03-21-19-51-44-crawler-resources.pdf > > > > 50 threads with explicit garbage collection: > > > > http://lisas.de/~adrian/crawler-resources/2015-03-22-06-18-30-crawler-resources.pdf > > > > 75 threads: > > > > http://lisas.de/~adrian/crawler-resources/2015-03-22-13-02-37-crawler-resources.pdf > > > > 75 threads with explicitly setting variables to None at the end: > > > > http://lisas.de/~adrian/crawler-resources/2015-03-23-07-46-19-crawler-resources.pdf > > > > Manually triggering the garbage collector makes almost no difference (if > > any at all). The crawler takes huge amount of memories and a really long > > time. > > > > As much as I like the new threaded design I am not 100% convinced it the > > best solution when looking at the memory requirements. Somewhere memory > > must be leaking. > > > > The next changes I will do is to sort the mirrors descending by the > > crawl duration to make sure the longest runnings crawls are started as > > early as possible (this was implemented in MM1). I will then try to > > start with 100 threads to see how long it takes and how much memory is > > required.
100 threads is too much with 32GB. This OOM'd and was killed.
> I would think that increasing threads would get bogged down by either
> network access or cpus. Since we aren't seeing more than 130% usage of
> CPU.. I am guessing it is bogging down to network access (eg it can only
> poll so many networks per second per interface and they can only return so
> quickly on that one interface). Do you think that having 2 or more crawler
> systems might do better?
I was hoping to implement 2 more crawlers in the end. With a simple
setup it is possible to distribute the crawling to more machines. We
know how many mirror hosts we have and the crawler can be given a host
start and stop id. This distribution will not be perfect as this does
not take into account that mirrors might be inactive/disabled/private
but for a simple setup to distribute the load it should be good enough.
Adrian
pgpsiqAbXcwzy.pgp
Description: PGP signature
_______________________________________________ infrastructure mailing list [email protected] https://admin.fedoraproject.org/mailman/listinfo/infrastructure
