On Fri, Jun 26, 2015 at 06:00:18PM +0000, [email protected] wrote: > * Readable status of directories > The Directory table has a 'readable' property, none of our directories is not > readable. > > Question is: what is the use-case for this boolean? > > == MD == Pre-bitflip content, which UMDL can see but the normal public can't > yet. Are you no longer bitflipping? Then it doesn't matter.
Ok, I see the use-case in the crawler, but in the UMDL, how did it work? The UMDL would not be allowed to read a given folder? > > * Changes while running > Looking at the code, the UMDL seems to be very careful to handle changes on > the > FS while it is running. > One hope I have is to speed up the UMDL run time, but I'm curious. > > Question: Does anyone know if the FS changes often while the UMDL is actually > running? > Gaining speed of course does not mean being wreakless but I'm curious as to > how > often this situation occurs. IIRC, we trigger the UMDL via fedmsg now, right? > So in theory, the FS shouldn't change too much under the UMDL's feet. > > * The directory table > So looking at the database and more precisely the directory table in that > database, it seems we store all the directories of the tree, ie: > /pub/alt/ > /pub/alt/anaconda/ > /pub/alt/bfo/ > /pub/alt/bfo/gpxe-20120514 > ... > This makes me a little pondering. What is the interest of keeping the whole > list of directories in the DB ? > After all, as far as I understand, the UMDL finds the repo in the tree (repo > being defined by the presence of a 'repodata' folder containing the repomd.xml > or by the presence of a 'summary' file and an 'objects' folder). > For these repo, we look for the most recent files, stores this info in the DB > and later use it to check if the mirrors are up to date. > > But do we need to checking that ``pub/fedora/linux`` exists when we later > check > that ``pub/fedora/linux/updates/testing/21/x86_64/`` exists and is up to date? > > I am under the impression currently that dropping un-necessary directories > would > save DB space (the directories being then linked in the host_category_dir > table > listing for each host, in each category which dir are present) as well as > crawling time (both in the UMDL and in the crawler). > > > == MD == You need non-repo directories for ISOs at least; there was a time > when we were able to mirror the entire Fedora static web content too; able > only because MM tracked all directories, not just repository directories. > MM1 also tried to be a "generic" mirror manager, not just a Fedora-specific > mirror manager, so I intentionally tracked everything, not just Yum repos. Idea: what if we were tracking only the folders that have files in them, so for example http://dl.fedoraproject.org/pub/epel/5/ would not end-up in the database. In addition, we could add a sort of blacklist to avoid storing http://dl.fedoraproject.org/pub/ just due to the presence of the DIRECTORY_SIZES.txt file This would reduce the number of directories we store for the Atomic tree. > * Non-directory based support in UDML. > > So the UMDL script currently supports three ways of crawling the tree: > * file > * rsync > * directory > > We, in Fedora, are only using the last one. I believe the `rsync` mode was > added > to support Ubuntu and the file mode is basically a simplified version of the > directory mode, but that we do not use at at the moment. > > I would like to propose that we drop support for rsync. I feel that it may be > simpler and easier to create an UMDL and a crawler for each distro that would > like to use MirrorManager than maintaining a one-script-fits-all UMDL that is > in fact tested for only one of the scenario. > That being said, if we ever have interest from Ubuntu, CentOS or any other > communities, we should definitively look into making the UMDL and crawler as > re-usable as possible for them, but keeping the distro-specific bits > separated. > > > == [file] was used early on for dev and testing. It's not interesting. > [rsync] would be used when you don't have access to a master mirror (or very > close replica). Perhaps the rpmfusion setup still needs this. I would have > for testing Ubuntu, certainly. It shouldn't be needed for production when > the content being mirrored out is managed by the same people operating > mirrormanager, as is the Fedora case. Apparently RPMFusion does need this, so it needs to stay, the question becoming: Should we split the different UMDL types into different scripts? The idea being that allow easier optimization then. (Note: I'm having this idea now but since I did not looked at what/how we could optimize, it may end-up remaining in the same file) Pierre _______________________________________________ infrastructure mailing list [email protected] https://admin.fedoraproject.org/mailman/listinfo/infrastructure
