Re: sorting yum/dnf metadata and metadata diffs
On 13.02.2015 08:11, Casey Jao wrote: How feasible would it be to keep the listings in primary.xml and filelists.xml sorted by package name and arch? Doing so could open the door to simple and efficient diffs of repository metadata. Something like pdiffs in Debian? Those two are by far the largest metadata files. If the observed improvements are typical, then keeping those files in order and hosting the diffs between the present and the previous few days (and modifying dnf to look for those diffs) could substantially reduce the amount of data that users must download every time a repository is updated, which for a fast-moving OS like Fedora could happen nearly every day. If only amount of download data matters then why not compress primary.xml and filelists.xml with xz? 11646147 primary.xml.gz 8676976 primary.xml.xz 30607019 filelists.xml.gz 23661236 filelists.xml.xz But yeah, it can make dnf/yum use more cpu power to uncompress them each time they want to use that data. -- devel mailing list devel@lists.fedoraproject.org https://admin.fedoraproject.org/mailman/listinfo/devel Fedora Code of Conduct: http://fedoraproject.org/code-of-conduct
Re: sorting yum/dnf metadata and metadata diffs
Hi, there's been some work in progress already: https://bugzilla.redhat.com/show_bug.cgi?id=850896 Proof-of-concept code (to be merged into dnf/createrepo_c in the future): https://github.com/Tojaj/DeltaRepo The idea behind that is simple: * create deltas as small repos on server * download deltas on client * do in-memory mergerepo on client (or cache it on disk if it makes sense) I consider this approach better than making diffs, especially because it's simple, clean and it can work with any repo format (sqlite, xml or mix of both). - daniel Dne 13.2.2015 v 08:11 Casey Jao napsal(a): How feasible would it be to keep the listings in primary.xml and filelists.xml sorted by package name and arch? Doing so could open the door to simple and efficient diffs of repository metadata. I recently ran some quick tests using python and elementtree. While the F21 primary.xml files from 2/7 and 2/9 both weigh around 2.6M compressed and ~18M uncompressed, sorting them and running a simple line-by-line comparison revealed a diff of ~500K, which compressed down to ~70K. A similar procedure on the 8M filelists.xml yielded a diff which compressed to ~200K. Those two are by far the largest metadata files. If the observed improvements are typical, then keeping those files in order and hosting the diffs between the present and the previous few days (and modifying dnf to look for those diffs) could substantially reduce the amount of data that users must download every time a repository is updated, which for a fast-moving OS like Fedora could happen nearly every day. -- Daniel Mach dm...@redhat.com Release Engineering, Red Hat -- devel mailing list devel@lists.fedoraproject.org https://admin.fedoraproject.org/mailman/listinfo/devel Fedora Code of Conduct: http://fedoraproject.org/code-of-conduct
Re: sorting yum/dnf metadata and metadata diffs
How feasible would it be to keep the listings in primary.xml and filelists.xml sorted by package name and arch? Doing so could open the door to simple and efficient diffs of repository metadata. Createrepo_c [1] keeps packages sorted by filename [2] by default. Sorting based on filenames was chosen intentionally, a package basename usually consists of name-version-release.arch - so the sorting is more deterministic than just name and arch. Tomas [1] https://github.com/Tojaj/createrepo_c [2] https://github.com/Tojaj/createrepo_c/blob/master/src/createrepo_c.c#L111 -- devel mailing list devel@lists.fedoraproject.org https://admin.fedoraproject.org/mailman/listinfo/devel Fedora Code of Conduct: http://fedoraproject.org/code-of-conduct
sorting yum/dnf metadata and metadata diffs
How feasible would it be to keep the listings in primary.xml and filelists.xml sorted by package name and arch? Doing so could open the door to simple and efficient diffs of repository metadata. I recently ran some quick tests using python and elementtree. While the F21 primary.xml files from 2/7 and 2/9 both weigh around 2.6M compressed and ~18M uncompressed, sorting them and running a simple line-by-line comparison revealed a diff of ~500K, which compressed down to ~70K. A similar procedure on the 8M filelists.xml yielded a diff which compressed to ~200K. Those two are by far the largest metadata files. If the observed improvements are typical, then keeping those files in order and hosting the diffs between the present and the previous few days (and modifying dnf to look for those diffs) could substantially reduce the amount of data that users must download every time a repository is updated, which for a fast-moving OS like Fedora could happen nearly every day. -- devel mailing list devel@lists.fedoraproject.org https://admin.fedoraproject.org/mailman/listinfo/devel Fedora Code of Conduct: http://fedoraproject.org/code-of-conduct