Hey, This follows on from the email I sent back in February [1].
1: https://lists.gnu.org/archive/html/guix-devel/2020-02/msg00268.html As it turns out, quite a lot has happened over the last month and a bit! In summary, this email talks about: - Providing database dumps, and how this works - Loading new revisions should now be much faster - A performance issue with the links of the package reproducibility page has been fixed - Data about builds and substitutes is more up to date - The Guix Data Service now runs on Guile 3 - System test derivations are now computed for multiple systems - You can view package history by "output" now, as well as version and derivation - I'm no longer the only person making code changes! There's now a page [2] that lists dumps of the database, previously this was just NGinx's representation of the /var/lib/guix-data-service/dumps directory. Creating new dumps was a manual process, but there's now a mcron job on the machine that takes care of this so new data should appear daily. 2: http://data.guix.gnu.org/dumps/ The Hetzner server on which data.guix.gnu.org is hosted only has 150GB of disk space for the database and store, with 73GB currently being taken up by the database. This didn't leave any space to store dumps, let alone generate the small dumps, which require restoring a copy of the database so that it can be modified. I added a 100GB volume to the server, which acts as temporary space for the dumps to be stored, and the small dumps to be created. For actually storing the dumps longer term, I'm using a combination of git-annex [3] and a file storage service called Wasabi [4]. I didn't want to write backup code that only worked with Wasabi, so the idea of using git-annex as well is that it deals with the details of how to move the files around. I picked Wasabi because the storage is quite cheep, and it doesn't charge for serving the files. Like the server, currently I'm paying for this. 3: https://git-annex.branchable.com/ 4: https://wasabi.com/ This should mean that backups are regularly available, which is convenient. Also, the small backup has been improved over the last month, it's now small again (~10GB for 2020-03-13, to 0.7GB for 2020-03-28) and includes data for system tests and channel instances now. I didn't test if Guile 3 had any impact on performance, but there have been some data loading performance improvements over the last month. The channel instance locking was improved, so more can be done in parallel. Building on some changes in Guix for the derivation linter, the Guix Data Service now can pass a store connection in to be used, which also makes loading new revisions a little faster. I also looking in to the very slow loading of package metadata [5]. This could take ~30 minutes previously, but I've now seen it happen in as little as 3 seconds! 5: Look for "debug: Finished querying the temp_package_metadata" in the job output It's not performance of loading data for new revisions, but I looked in to why the links on the package reproducibility page ([6] for example) for a revision would time out. This turned out to be an easy fix, just add a database index in the right place. While the lack of data about builds is still a limiting factor, this page [6] should be a bit more useful and usable. 6: http://data.guix.gnu.org/revision/8f83699ba00743d258b497e0e5285989996ee559/package-reproducibility I also spent some time debugging why the script for querying build servers would hang or break when run for long periods. I think this was resolved with some tweaks to http-get-multiple [7], and now I've been able to leave the script running. This should have two positive effects, the build and narinfo information on data.guix.gnu.org should be more up to date than it was previously. Secondly, because the Guix Data Service is regularly querying to narinfo files, including for new derivations, this'll prompt guix-publish to bake nar files for these outputs hopefully soon after the output has been generated. 7: https://debbugs.gnu.org/cgi/bugreport.cgi?bug=39873 The Guix Data Service now works with Guile 3, and the Guix package has been changed to use Guile 3. System test derivations are now generated for multiple systems [8]. 8: http://data.guix.gnu.org/revision/8b87d095b39dee91056b88f96b374faa8c3a8891/system-tests Previously you could view the version, or derivation history for a package on a branch, but now you can view the "output" history as well [9]. This is in some ways more useful than the derivations history, as there you get more entries due to changes in fixed output generations. 9: http://data.guix.gnu.org/repository/1/branch/master/package/libreoffice/output-history I'm also now not the only one to have worked on the Guix Data Service [10]. This is a positive sign for the "Improve internationalization support for the Guix Data Service" Outreachy project. Providing there's a successful applicant, I believe that'll be announced on the 27th of April. 10: https://git.savannah.gnu.org/cgit/guix/data-service.git/commit/?id=f980b6c2acd4388627b5abb30bdf98fcbb18fb7f Looking forward, I'd still like to see loading data be faster. One thing I might try is parallelising parts like computing the channel instance derivations and running the lint checkers. I'd also like to make some sort of sitemap to make the pages more discoverable. Hopefully though, it's getting towards the point where the Guix Data Service can start being used as something to build upon, which is the way I've been thinking about it. By making data about Guix available in this format, it should be easier to build new and exciting tools and services. Just let me know if you have any comments or questions! Chris
signature.asc
Description: PGP signature
