Jonathan, Have you experimented with casync ?
Vít  https://github.com/systemd/casync Dne 13.2.2018 v 10:52 Igor Gnatenko napsal(a): > CCing rpm-ecosystem@ ML since it's main location where this message > should have > went 😉 > > On Mon, 2018-02-12 at 23:53 +0200, Jonathan Dieter wrote: > > <tl;dr> > > I've come up with a method of splitting repodata into chunks that can > > be downloaded and combined with chunks that are already on the local > > system to create a byte-for-byte copy of the compressed repodata. > > Tools and scripts are at: > > https://www.jdieter.net/downloads/ > > </tl;dr> > > > Background: > > With DNF, we're currently downloading ~20MB of repository data every > > time the updates repository changes. > > > When casync was released, I wondered if we could use it to only > > download the deltas for the repodata. At Flock last summer, I ran some > > tests against the uncompressed repodata and saw a reduction of 30-40% > > from one day to the next, which seemed low, but was a good starting > > point. > > > Unfortunately, due to the way casync separates each file into thousands > > of compressed chunks, building each file required thousands of (serial) > > downloads which, even on a decent internet connection, took *forever*. > > > When I talked through the idea with Kevin and Patrick, they also > > pointed out that our mirrors might not be too keen on the idea of > > adding thousands of tiny files that change every day. > > > > The Solution(?): > > One potential solution to the "multitude of files" problem is to merge > > the chunks back into a single file, and use HTTP ranges to only > > download the parts of the file we want. An added bonus is that most > > web servers are configured to support hundreds of ranges in one > > request, which greatly reduces the number of requests we have to make. > > > The other problem with casync is that it's chunk separation is naïve, > > which is why we were only achieving 30-40% savings. But we know what > > the XML file is supposed to look like, so we can separate the chunks on > > the tag boundaries in the XML. > > > So I've ditched casync altogether and put together a proof-of-concept > > (tentatively named zchunk) that takes an XML file, compresses each tag > > separately, and then concatenates all of them into one file. The tool > > also creates an index file that tells you the sha256sum for each > > compressed chunk and the location of the chunk in the file. > > > I've also written a small script that will download a zchunk off the > > internet. If you don't specify an old file, it will just download > > everything, but if you specify an old file, it will download the index > > of the new file and compare the sha256sums of each chunk. Any > > checksums that match will be taken from the old file, and the rest will > > be downloaded. > > > In testing, I've seen savings ranging from 10% (December 17 to today) > > to 95% (yesterday to today). > > > > Remaining problems: > > * Zchunk files are bigger than their gzip equivalents. This ranges > > from 5% larger for filelists.xml to 300% larger for primary.xml. > > This can be greatly reduced by chunking primary.xml based on srpm > > rather than rpm, which brings the size increase for primary.xml down > > to roughly 30%. > > > * Many changes to the metadata can mean a large number of ranges > > requested. I ran a check on our mirrors, and three (out of around > > 150 that had the file I was testing) don't honor range requests at > > all, and three others only honor a small number in a single request. > > A further seven didn't respond at all (not sure if that had > > anything to do with the range requests), and the rest supported > > between 256 and 512 ranges in a single request. We can reduce the > > number of ranges requested by always ordering our packages by date. > > This would ensure that new packages are grouped at the end of the > > xml where they will be grabbed in one contiguous range. > > This would "break" DNF, because libsolv is assigning Id's by the order of > packages in metadata. So if something requires "webserver" and there > is "nginx" > and "httpd" providing it (without versions), then lowest Id is picked > up (not > going into details of this). Which means depending on when last update > for one > or other was submitted, users will get different results. This is > unacceptable > from my POV. > > > * Zchunk files use zlib (it gives better compression than xz with such > > small chunks), but, because they use a custom zdict, they are not gz > > files. This means that we'll need new tools to read and write them. > > (And I am volunteering to do the work here) > > What about zstd? Also in latest version of lz4 there is support for > dictionaries too. > > > The tools: > > The proof-of-concept tools are all sitting in > > https://www.jdieter.net/downloads/zchunk-scripts/ > > > They are full of ugly hacks, especially when it comes to parsing the > > XML, there's little to no error reporting, and I didn't comment them > > well at all, but they should work. > > > If all you want to do is download zchunks, you need to run dl_zchunk.py > > with the url you want to download (ending in .zck) as the first > > parameter. Repodata for various days over the last few weeks is at: > > https://www.jdieter.net/downloads/zchunk-test/ You may need to hover > > over the links to see which is which. The downloads directory is also > > available over rsync at rsync://jdieter.net/downloads/zchunk-test. > > > dl_zchunk.py doesn't show anything if you download the full file, but > > if you run the command with an old file as the second parameter, it > > will show four numbers: bytes taken from the old file, bytes downloaded > > from the new, total downloaded bytes and total uploaded bytes. > > > zchunk.py creates a .zck file. To group chunks by source rpm in > > primary.xml, run > > ./zchunk.py <file> rpm:sourcerpm > > > unzchunk.py decompresses a .zck file to stdout > > > I realize that there's a lot to digest here, and it's late, so I know I > > missed something. Please let me know if you have any suggestions, > > criticisms or flames, though it might be a few hours before I respond. > > As being someone who tried to work on this problem I very appreciate > what you > have done here. We've started with using zsync and results were quite > good, but > zsync is dead and has ton of bugs. Also it requires archives to be ` > --rsyncable`. So my question is why not to add idx file as additional > one for > existing files instead of inventing new format? The problem is that we > will > have to distribute in old format too (for compatibility reasons). > > I'm not sure if trying to do optimizations by XML tags is very good idea > especially because I hope that in future we would stop distributing > XML's and > start distributing solv/solvx. > > _______________________________________________ > Rpm-ecosystem mailing list > Rpmfirstname.lastname@example.org > http://lists.rpm.org/mailman/listinfo/rpm-ecosystem
Description: OpenPGP digital signature
_______________________________________________ Rpm-ecosystem mailing list Rpmemail@example.com http://lists.rpm.org/mailman/listinfo/rpm-ecosystem