Re: Master and mirror crawling

2015-09-15 Thread Adrian Reber
On Sat, Sep 12, 2015 at 12:52:55PM -0600, Kevin Fenzi wrote:
> > On Fri, Sep 11, 2015 at 04:56:41PM +0200, Adrian Reber wrote:
> > [...]
> > > So my main question is if we should insert a delay between umdl and
> > > the crawl of the mirrors? This would require a fedmsg emitted at
> > > the end of an umdl run and something on the crawler which waits
> > > some time before starting the crawls.
> > 
> > Thinking more about it, it actually does not make much sense to base
> > the mirror crawls on fedmsg. The mirrors are updated at (from our
> > point of view) random times. So with category based crawling we have
> > the possibility to increase the crawl frequency for Fedora Linux and
> > Fedora EPEL and decrease it for Fedora Archive. Which should
> > hopefully give MirrorManager a better view of the status of the
> > mirrors.
> 
> Well, mirrors that are using your script to trigger syncs after a
> fedmsg would be syncing right after that as well, but might depend on
> how long it takes them to sync. 

Yes, my mirror syncs from ::fedora-buffet0/ and that takes a few hours.

Adrian


pgpfoxWIvJ7AD.pgp
Description: PGP signature
___
infrastructure mailing list
infrastructure@lists.fedoraproject.org
http://lists.fedoraproject.org/postorius/infrastructure@lists.fedoraproject.org


Re: Master and mirror crawling

2015-09-12 Thread Kevin Fenzi
On Fri, 11 Sep 2015 20:42:23 +0200
Adrian Reber  wrote:

> On Fri, Sep 11, 2015 at 04:56:41PM +0200, Adrian Reber wrote:
> [...]
> > So my main question is if we should insert a delay between umdl and
> > the crawl of the mirrors? This would require a fedmsg emitted at
> > the end of an umdl run and something on the crawler which waits
> > some time before starting the crawls.
> 
> Thinking more about it, it actually does not make much sense to base
> the mirror crawls on fedmsg. The mirrors are updated at (from our
> point of view) random times. So with category based crawling we have
> the possibility to increase the crawl frequency for Fedora Linux and
> Fedora EPEL and decrease it for Fedora Archive. Which should
> hopefully give MirrorManager a better view of the status of the
> mirrors.

Well, mirrors that are using your script to trigger syncs after a
fedmsg would be syncing right after that as well, but might depend on
how long it takes them to sync. 

kevin


pgp1e24PVBwbO.pgp
Description: OpenPGP digital signature
___
infrastructure mailing list
infrastructure@lists.fedoraproject.org
http://lists.fedoraproject.org/postorius/infrastructure@lists.fedoraproject.org


Re: Master and mirror crawling

2015-09-11 Thread Adrian Reber
On Fri, Sep 11, 2015 at 04:56:41PM +0200, Adrian Reber wrote:
[...]
> So my main question is if we should insert a delay between umdl and the
> crawl of the mirrors? This would require a fedmsg emitted at the end of
> an umdl run and something on the crawler which waits some time before
> starting the crawls.

Thinking more about it, it actually does not make much sense to base the
mirror crawls on fedmsg. The mirrors are updated at (from our point of
view) random times. So with category based crawling we have the
possibility to increase the crawl frequency for Fedora Linux and Fedora
EPEL and decrease it for Fedora Archive. Which should hopefully give
MirrorManager a better view of the status of the mirrors.

Adrian


pgpnDO7w5aLYJ.pgp
Description: PGP signature
___
infrastructure mailing list
infrastructure@lists.fedoraproject.org
http://lists.fedoraproject.org/postorius/infrastructure@lists.fedoraproject.org


Master and mirror crawling

2015-09-11 Thread Adrian Reber

One of my next goals to improve mirror crawling is to split the crawls
of the mirrors by category. Right now we select a mirror and crawl all
categories (Fedora Linux, Fedora EPEL, Fedora Secondary, Fedora
Archives, Fedora Other) in one go. The drawback is that it is nearly
impossible to crawl a mirror which mirrors everything within the time
limit of 3 hours. There are a few mirrors which actually mirror
everything and they are usually dropped from the mirror list because the
crawler always hits the 3 hour limit and marks the mirror as not being
up to date. The current solution is to create multiple hosts (which can
point to the same mirror) with only one or two categories. This works
but it is not the optimal solution.

The actual scanning of the remote mirror is most of the time not the
real problem, but also updating the status of all those directories and
files in the local database takes a very long time.

The master crawling by update-master-directory-list (umdl) is already
split up by category and fedmsg driven (for most categories). So
whenever a repository is updated umdl starts a scan and updates the
database for only the category which has changed. This works pretty good
but has the disadvantage that the database is now much faster updated
without the possibility for the mirrors to sync before we have new
information in the database.

The reason for this long introduction is that my original plan was to
immediately start a category crawl after umdl has signalled that a certain
category has been updated in the database. This could lead to a very
short list of mirrors which are up to date and therefore I would like to
know if we should somehow introduce a delay between the time umdl has
run and the time we start to crawl the mirrors. This would give the
mirrors some time to sync the content before we crawl them.

Right now the time between the update of the master mirror and the crawl
can be between 0 hours and 12 hours. With a defined time before crawling
the mirrors this would be more clearer than right now.

I am also hoping to be able to crawl the mirrors more often than twice a
day if moving to category based crawls.

So my main question is if we should insert a delay between umdl and the
crawl of the mirrors? This would require a fedmsg emitted at the end of
an umdl run and something on the crawler which waits some time before
starting the crawls.

Adrian


pgpPCARiPWLup.pgp
Description: PGP signature
___
infrastructure mailing list
infrastructure@lists.fedoraproject.org
http://lists.fedoraproject.org/postorius/infrastructure@lists.fedoraproject.org