I would like to change the setup of our mirror crawler and just wanted
to mention my planned changes here before working on them.

Currently we have two VMs which are crawling our mirrors. Each of the
machine is responsible for one half of the active mirrors. The crawl
starts every 12 hours on the first crawler and 6 hours later on the
second crawler. So every 6 hours one crawler is accessing the database.

Currently most of the crawling time is not spent crawling but updating
the database about which host has which directory up to date. With a
timeout of 4 hours per host we are hitting that timeout on some hosts
regularly and most of the time the database access is the problem.

What I would like to change is to crawl each category (Fedora Linux,
Fedora Other, Fedora EPEL, Fedora Secondary Arches, Fedora Archive)
separately and at different times and intervals.

We would not hit the timeout as often as now as only the information for
a single category has to be updated. We could scan 'Fedora Archive' only
once per day or every second day. We can scan 'Fedora EPEL' much more
often as it is usually really fast and get better data about the
available mirrors.

My goal would be to distribute the scanning in such a way to decrease
the load on the database and to decrease the cases of mirror
auto-deactivation due to slow database accesses. 

Let me know if you think that these planned changes are the wrong
direction of if you have other ideas how to improve the mirror crawling.


Attachment: signature.asc
Description: PGP signature

infrastructure mailing list -- infrastructure@lists.fedoraproject.org
To unsubscribe send an email to infrastructure-le...@lists.fedoraproject.org

Reply via email to