On Fri, Jun 26, 2015 at 06:00:18PM +0000, [email protected] wrote:
> * Readable status of directories
> The Directory table has a 'readable' property, none of our directories is not
> readable.
> 
> Question is: what is the use-case for this boolean?
> 
> == MD == Pre-bitflip content, which UMDL can see but the normal public can't 
> yet.  Are you no longer bitflipping?  Then it doesn't matter.

Ok, I see the use-case in the crawler, but in the UMDL, how did it work?
The UMDL would not be allowed to read a given folder?

> 
> * Changes while running
> Looking at the code, the UMDL seems to be very careful to handle changes on 
> the
> FS while it is running.
> One hope I have is to speed up the UMDL run time, but I'm curious.
> 
> Question: Does anyone know if the FS changes often while the UMDL is actually
> running?
> Gaining speed of course does not mean being wreakless but I'm curious as to 
> how
> often this situation occurs. IIRC, we trigger the UMDL via fedmsg now, right?
> So in theory, the FS shouldn't change too much under the UMDL's feet.
> 
> * The directory table
> So looking at the database and more precisely the directory table in that
> database, it seems we store all the directories of the tree, ie:
> /pub/alt/
> /pub/alt/anaconda/
> /pub/alt/bfo/
> /pub/alt/bfo/gpxe-20120514
> ...
> This makes me a little pondering. What is the interest of keeping the whole
> list of directories in the DB ?
> After all, as far as I understand, the UMDL finds the repo in the tree (repo
> being defined by the presence of a 'repodata' folder containing the repomd.xml
> or by the presence of a 'summary' file and an 'objects' folder).
> For these repo, we look for the most recent files, stores this info in the DB
> and later use it to check if the mirrors are up to date.
> 
> But do we need to checking that ``pub/fedora/linux`` exists when we later 
> check
> that ``pub/fedora/linux/updates/testing/21/x86_64/`` exists and is up to date?
> 
> I am under the impression currently that dropping un-necessary directories 
> would
> save DB space (the directories being then linked in the host_category_dir 
> table
> listing for each host, in each category which dir are present) as well as
> crawling time (both in the UMDL and in the crawler).
> 
> 
> == MD == You need non-repo directories for ISOs at least; there was a time 
> when we were able to mirror the entire Fedora static web content too; able 
> only because MM tracked all directories, not just repository directories.  
> MM1 also tried to be a "generic" mirror manager, not just a Fedora-specific 
> mirror manager, so I intentionally tracked everything, not just Yum repos.
 
Idea: what if we were tracking only the folders that have files in them, so for
example http://dl.fedoraproject.org/pub/epel/5/ would not end-up in the
database.

In addition, we could add a sort of blacklist to avoid storing
http://dl.fedoraproject.org/pub/ just due to the presence of the
DIRECTORY_SIZES.txt file

This would reduce the number of directories we store for the Atomic tree.

> * Non-directory based support in UDML.
> 
> So the UMDL script currently supports three ways of crawling the tree:
> * file
> * rsync
> * directory
> 
> We, in Fedora, are only using the last one. I believe the `rsync` mode was 
> added
> to support Ubuntu and the file mode is basically a simplified version of the
> directory mode, but that we do not use at at the moment.
> 
> I would like to propose that we drop support for rsync. I feel that it may be
> simpler and easier to create an UMDL and a crawler for each distro that would
> like to use MirrorManager than maintaining a one-script-fits-all UMDL that is
> in fact tested for only one of the scenario.
> That being said, if we ever have interest from Ubuntu, CentOS or any other
> communities, we should definitively look into making the UMDL and crawler as
> re-usable as possible for them, but keeping the distro-specific bits 
> separated.
> 
> 
> == [file] was used early on for dev and testing.  It's not interesting.  
> [rsync] would be used when you don't have access to a master mirror (or very 
> close replica).  Perhaps the rpmfusion setup still needs this.  I would have 
> for testing Ubuntu, certainly.  It shouldn't be needed for production when 
> the content being mirrored out is managed by the same people operating 
> mirrormanager, as is the Fedora case.

Apparently RPMFusion does need this, so it needs to stay, the question becoming:
Should we split the different UMDL types into different scripts?
The idea being that allow easier optimization then.
(Note: I'm having this idea now but since I did not looked at what/how we could
optimize, it may end-up remaining in the same file)


Pierre
_______________________________________________
infrastructure mailing list
[email protected]
https://admin.fedoraproject.org/mailman/listinfo/infrastructure

Reply via email to