Hello maintainers, Here's a small/medium sized coding project that we would significantly benefit from. It's nicely self-contained and it does have some algorithmic components. It can be written in any programming language (that we can run on our buildfarm).
Our current catalog generation takes about 80 minutes. If we can make it faster, we can generate the catalog more often and have a quicker build-push-release turnaround, and relieve the buildfarm from most of the current catalog-generation-induced disk stress. We currently run the generation every 3h. If we can make the generation complete in something like 10 minutes (which I think is possible), we could run catalog generation e.g. every hour. We have a directory on disk with a package catalog, as we can see on the mirror: http://mirror.opencsw.org/opencsw/unstable/i386/5.10/ We can query the RESTful interface for the current state of the same catalog in the database: curl -s http://buildfarm.opencsw.org/pkgdb/rest/catalogs/unstable/i386/SunOS5.10/ \ | python -m json.tool | (head -n 30; cat >/dev/null) [ { "basename": "389_admin-1.1.30,REV=2013.01.07-SunOS5.10-i386-CSW.pkg.gz", "catalogname": "389_admin", "file_basename": "389_admin-1.1.30,REV=2013.01.07-SunOS5.10-i386-CSW.pkg.gz", "md5_sum": "6110aad210240504ede48f9cd8b4501c", "mtime": "2013-01-07T12:02:22", "rev": "2013.01.07", "size": 403046, "version": "1.1.30,REV=2013.01.07", "version_string": "1.1.30,REV=2013.01.07" }, (...) ] (the query takes about 25s to evaluate; the python bit is here just for data pretty-printing) We also have the 'allpkgs' directory: http://mirror.opencsw.org/opencsw/allpkgs/ It's excluded from rsync, so it doesn't get propagated to mirrors, but it does exist on the master mirror and the buildfarm. It's the central pool for all the package data files. When we generate catalogs, we do not copy anything, instead we make hardlinks to the allpkgs directory. For example, we make a hardlink from allpkgs/foo-i386-CSW.pkg.gz to unstable/5.9/i386. However, when we generate a catalog for the next OS release (e.g. 5.10), we do not make a hardlink; if possible, we make a symlink from the 5.10 directory to the 5.9 directory. This way we save space on mirrors: we only send out 1 copy of the file (in the lowest OS release in which it occurs), and then we create symlinks to it. For example: allpkgs/foo-i386-CSW.pkg.gz (not synced to mirrors) unstable/i386/5.9/foo-i386-CSW.pkg.gz (hardlink to the file in allpkgs) unstable/i386/5.10/foo-i386-CSW.pkg.gz → ../5.9/foo-i386-CSW.pkg.gz (symlink) unstable/i386/5.11/foo-i386-CSW.pkg.gz → ../5.9/foo-i386-CSW.pkg.gz (symlink) You can now see that we need to generate catalogs for one catalog release (e.g. unstable) and one architecture, and all OS releases in one program run. We do currently have code that does it, but the code is really stupid. It unlinks everything from the directory, and starts from scratch every time. This generates a lot of unnecessary disk operations, and makes the whole process slow. It would be much better to see what's in the database, see what's on disk, and figure out the smallest set of operations to bring the disk to the new state. Would anyone be up for writing it? Maciej
_______________________________________________ maintainers mailing list [email protected] https://lists.opencsw.org/mailman/listinfo/maintainers .:: This mailing list's archive is public. ::.
