Re: proposal for a more efficient download process
#include * curt manucredo (hansycm) [Fri, May 26 2006, 07:53:58PM]: > this can lead to this: > -- > if a patch is available: > > 1. look in /var/cache/apt/packages for the package to be updated. if the > old one is there patch it's files. md5sum. happy? if not... > > 2. try to repack the package with dpkg-repack. patch the files. md5sum. > if no success... Repacking may suck, as pointed out by others. Why not just modify dpkg? I imagine a new kind of "package-diff" packages, containing diffs instead of real files. Eg. for the last n versions of a package (or for m versions in a certain time frame). This way files would be created smoothly on-the-fly. Of course the special dpkg program would use debsums first to check the integrity of the installed package contents before trying to patch them. And of course it would only continue after the contents have been copied and patched successfully in a separate location. Eduard. -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]
Re: proposal for a more efficient download process
A Mennucc <[EMAIL PROTECTED]> wrote: > Absolutely true. Look at this > > $ ls -s tetex-doc_3.0-17_all.deb tetex-doc_3.0-18_all.deb > 42388 tetex-doc_3.0-18_all.deb 42340 tetex-doc_3.0-17_all.deb > > $ bsdiff tetex-doc_3.0-17_all.deb tetex-doc_3.0-18_all.deb brutal.bsdiff > $ ls -s brutal.bsdiff > 10092 brutal.bsdiff > > Hat tip to 'bsdiff', but we can do better... > > $ ar p tetex-doc_3.0-17_all.deb data.tar.gz | zcat > /tmp/17.tar > $ ar p tetex-doc_3.0-18_all.deb data.tar.gz | zcat > /tmp/18.tar > $ ls -s /tmp/17.tar /tmp/18.tar > > 53532 /tmp/17.tar 53580 /tmp/18.tar > > $ time bsdiff /tmp/17.tar /tmp/18.tar /tmp/tar.bsdiff > > times: > real2m4.994s user2m3.947s > memory: > PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND > 9784 debdev25 0 471m 470m 1384 T 0.0 46.5 1:18.82 bsdiff > size: > 92 /tmp/tar.bsdiff I guess this is 92 kByte? > so as you see, the reduction in size is impressive, > but it uses too much memory and takes too much time. Don't know whether this is in fact a typical example in terms of memory consumption, because of: tetex-base (3.0-18) unstable; urgency=low [...] * Move the documentation from /usr/share/doc/texmf/ to /usr/share/doc/tetex-doc and let the symlink point to the new location, in accordance with new policy, and to allow parallel installation of some texlive packages. So nearly each file that existed in 3.0-17 is at a new location in 3.0-18. It's impressive that bsdiff is able to notice that and reduce the diff to such a small size. The size is really small, especially because of: * Add a PDF documentation file for pst-poly which is only present as LaTeX source [frank] and ls -l /usr/share/texmf-tetex/doc//generic/pstricks/pst-poly.pdf.gz -rw-r--r-- 1 root root 115290 2004-11-21 07:51 /usr/share/texmf-tetex/doc//generic/pstricks/pst-poly.pdf.gz Regards, Frank -- Frank Küster Single Molecule Spectroscopy, Protein Folding @ Inst. f. Biochemie, Univ. Zürich Debian Developer (teTeX)
Re: proposal for a more efficient download process
hi by quite a coincidence, while you people were discussing this idea, I was already implementing it, in a package called 'debdelta' : see http://lists.debian.org/debian-devel/2006/05/msg03120.html Moreover, by some telepathy :-) , I already included features you were proposing, and addressed problems you where discussing (and other problems you were not discussing since you did not try implementing it :-) Here are the replies: To curt manucredo : while the implementation is not exactly what you were suggesting in your original email, it still achieves all desired goals; moreover, it is alive an kicking. 'debdelta' differs from your implementation in this respect: - it does not use dpkg-repack (for many good reasons, see below) - it recreates the new .deb , and guarantees that it is equal to the one in archives, so archive signatures can be verified; currently it does not patch into the filesystem (altough this would be an easy adaptation, if anybody wishes for it) 'debdelta' conforms to your requests, in that - it can recreate the new .deb, either using the installed version of the old .deb, or old .deb file. On the bright side, everything is already working, there is already a repository of patches available, and a method of downloading them. To Tyler MacDonald : - 'debdelta' uses 'bsdiff' , or 'xdelta' as a fallback, see below - regarding this: > Some work will have to go into the math to determine when it's > actually more efficient to download the latest archive, etc just a > fleeting mental note, the threshold should not be 100% of the full archives > size, it should be 90 or 80% due to the CPU/RAM overhead of patching and the > bandwidth/latency overhead of requesting multiple patch files vs. one > stream of data. This math must go in the client side, and it is in my TODO list (see at the end of the README); it is a nice exercise in Dynamical Programming. Anyway , currently the archive discards deltas that exceed ~50% of the new .deb , just as an heuristic, and to keep disk usage low. To Goswin von Brederlow : >| bsdiff is quite memory-hungry. It requires max(17*n,9*n+m)+O(1) Ah so this is the correct formula! The man page just says '17*n'. But in my stats, that that is not the case; my stats are estimating that the memory is '12*n' so that is what I use >| bytes of memory, where n is the size of the old file and m is the >| size of the new file. bspatch requires n+m+O(1) bytes. > That is quite unacceptable. We have debs in debian up to 160Mb 'debdelta' has an option '-M ' to choose between 'xdelta' and 'bsdiff' ; by default, it uses 'xdelta' when memory usage would exceed 50Mb ; but in the server, I set '-M 200' since I have 1GB RAM there. > Seems to be quite useless for patching full debs. One would have to > limit it to a file-by-file approach. This is in my TODO list. Actually, I have in mind a scheme to break TARs at suitable points, I have to check if it is worthwhile ; I can discuss details. To: Tyler MacDonald again: > True.. It'd probably only be efficient if the deltas were based on > the contents of the .deb's before they're packed. .. and this is the reason why I do not use dpkg-repack... why unpacking data when I need them unpacked ? :-) Absolutely true. Look at this $ ls -s tetex-doc_3.0-17_all.deb tetex-doc_3.0-18_all.deb 42388 tetex-doc_3.0-18_all.deb 42340 tetex-doc_3.0-17_all.deb $ bsdiff tetex-doc_3.0-17_all.deb tetex-doc_3.0-18_all.deb brutal.bsdiff $ ls -s brutal.bsdiff 10092 brutal.bsdiff Hat tip to 'bsdiff', but we can do better... $ ar p tetex-doc_3.0-17_all.deb data.tar.gz | zcat > /tmp/17.tar $ ar p tetex-doc_3.0-18_all.deb data.tar.gz | zcat > /tmp/18.tar $ ls -s /tmp/17.tar /tmp/18.tar 53532 /tmp/17.tar 53580 /tmp/18.tar $ time bsdiff /tmp/17.tar /tmp/18.tar /tmp/tar.bsdiff times: real2m4.994s user2m3.947s memory: PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 9784 debdev25 0 471m 470m 1384 T 0.0 46.5 1:18.82 bsdiff size: 92 /tmp/tar.bsdiff so as you see, the reduction in size is impressive, but it uses too much memory and takes too much time. $ time xdelta delta -m 50M -9 /tmp/17.tar /tmp/18.tar /tmp/tar.xdelta times: real0m1.728s user0m1.660s memory... it is too fast size: 236 /tmp/tar.xdelta still good enough for our goal Comparing to the above $ ls -s pool/main/t/tetex-base/tetex-doc_3.0-17_3.0-18_all.debdelta 288 pool/main/t/tetex-base/tetex-doc_3.0-17_3.0-18_all.debdelta (the extra 35kB are the script that 'debpatch' uses :-( actually, I told 'debdelta' to use 'bzip' instead of gzip in this cases, but it did not... just found another bug :-) ) To: Marc 'HE' Brockschmidt <[EMAIL PROTECTED]>: > Now the interesting questions: How many diffs do you keep? very few, currently, due to space constraints; moreover , suppose that you have a_1.deb installed, a_1_2.debdelta and a_2_3.debdelta are in pool of deltas, wan
Re: proposal for a more efficient download process
"curt manucredo (hansycm)" <[EMAIL PROTECTED]> writes: > Marc 'HE' Brockschmidt <[EMAIL PROTECTED]> wrote: > > >Nope. You will need to keep all normal debs anyway, for new > >installations. > > i thought it could be possible in the end to download the tree-package > and all its patches to then have the latest package for a new install! > so i thought there will be no more need for a lot of full packages. is > it not? one of the advantages could be that you have more versions > available then just the latest - this would be great for sid! But stupid for stable and, since testing is the testbed for the next stable, for testing also. You need a full deb there to build proper CDs and DVDs. And since you don't know what version will make it into stable beforehand you have to save the full deb of every version. MfG Goswin -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]
re: proposal for a more efficient download process
Marc 'HE' Brockschmidt <[EMAIL PROTECTED]> wrote: >Anyway, this was proposed some times now. Have you actually read the >old >threads and can explain why your proposal is better and actually works? >Why haven't you implemented it yet? not right now. i just have found out that there were some same discustions about it just some days before. sorry. i never have claimed that to be my idea. i just sad it is a proposal. since i am new on debian-devel i will probably have to find out even more. so please let me have a chance to do so! and i never said my proposal will work though. well, i just thought i would have come up with a new idea. how stupid! :-) -- greetings from austria well, though i think i can't fix that problem, but i believe i can make a workaround! * curt manucredo [EMAIL PROTECTED] "Only two things are infinite, the universe and human stupidity, and I'm not sure about the former." -- Albert Einstein -- -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]
re: proposal for a more efficient download process
Marc 'HE' Brockschmidt <[EMAIL PROTECTED]> wrote: >Nope. You will need to keep all normal debs anyway, for new >installations. i thought it could be possible in the end to download the tree-package and all its patches to then have the latest package for a new install! so i thought there will be no more need for a lot of full packages. is it not? one of the advantages could be that you have more versions available then just the latest - this would be great for sid! >Now the interesting questions: How many diffs do you keep? i thought of keeping the tree-package and its patches as long it makes sence. for example if there is a next-version package and the patches would grow to big, there will come up a new tree-package. well, yes, it is difficult to think this through, but anyway! >How do you >integrate this approach with the minimal security Release files give us >today? What about the kind of signatures dpkg-sig provides? sure. this proposal would require a lot of changes not just a few. but as i have suggested not to make a .deb oriented but a file oriented patchment, the new package will be created on the users system with the downloaded patch(es). so in the end, there will be a .deb package in the cache and it will just install as always. if you make a package-mirror- update to look for updates, it just will show there is a new package. the user will not find out that it just downloads the patches. hope that answers your question. i am not quite sure. i am new! so please try to ask in another way if this does not satisfy you! thank's :-) -- greetings from austria well, though i think i can't fix that problem, but i believe i can make a workaround! * curt manucredo [EMAIL PROTECTED] "Only two things are infinite, the universe and human stupidity, and I'm not sure about the former." -- Albert Einstein -- -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]
Re: proposal for a more efficient download process
Tyler MacDonald <[EMAIL PROTECTED]> writes: > Goswin von Brederlow <[EMAIL PROTECTED]> wrote: >> That is quite unacceptable. We have debs in debian up to 160Mb >> (packed) and 580Mb unpacked. That would require 2.7 Gb and nearly 10Gb >> ram respectively. >> >> Seems to be quite useless for patching full debs. One would have to >> limit it to a file-by-file approach. > > True.. It'd probably only be efficient if the deltas were based on > the contents of the .deb's before they're packed. That is pretty much a given anyway imho. MfG Goswin -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]
Re: proposal for a more efficient download process
"curt manucredo (hansycm)" <[EMAIL PROTECTED]> writes: > II.B. on the upload and storage side > > > the upload process may need some more changes though (e.g.: for > automation). if this ever comes true, there will have to be a period of > time where both, the old way and this way have to work, of course. Nope. You will need to keep all normal debs anyway, for new installations. Now the interesting questions: How many diffs do you keep? How do you integrate this approach with the minimal security Release files give us today? What about the kind of signatures dpkg-sig provides? Anyway, this was proposed some times now. Have you actually read the old threads and can explain why your proposal is better and actually works? Why haven't you implemented it yet? Marc -- Fachbegriffe der Informatik - Einfach erklärt (176: NT-Consulter) italienische Ledertreter, Achselschweiß. Erklärt Probleme dadurch, daß man nicht die richtigen Kurse in Unterschleißheim belegt hat und sich dies sofort rächt. (Anders Henke) pgp4ZeDkRde4E.pgp Description: PGP signature
Re: proposal for a more efficient download process
Goswin von Brederlow <[EMAIL PROTECTED]> wrote: > That is quite unacceptable. We have debs in debian up to 160Mb > (packed) and 580Mb unpacked. That would require 2.7 Gb and nearly 10Gb > ram respectively. > > Seems to be quite useless for patching full debs. One would have to > limit it to a file-by-file approach. True.. It'd probably only be efficient if the deltas were based on the contents of the .deb's before they're packed. -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]
Re: proposal for a more efficient download process
Tyler MacDonald <[EMAIL PROTECTED]> writes: > +1. We've been using bsdiff (http://www.daemonology.net/bsdiff/) at > work for some internal stuff and it's great. Oh, and one more thing: | bsdiff is quite memory-hungry. It requires max(17*n,9*n+m)+O(1) | bytes of memory, where n is the size of the old file and m is the | size of the new file. bspatch requires n+m+O(1) bytes. That is quite unacceptable. We have debs in debian up to 160Mb (packed) and 580Mb unpacked. That would require 2.7 Gb and nearly 10Gb ram respectively. Seems to be quite useless for patching full debs. One would have to limit it to a file-by-file approach. MfG Goswin -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]
Re: proposal for a more efficient download process
Tyler MacDonald <[EMAIL PROTECTED]> writes: > http://www.daemonology.net/bsdiff/ How does that compare with rsync batch files? MfG Goswin -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]
Re: proposal for a more efficient download process
> I. the reason why i suggest a patch-oriented download process +1. We've been using bsdiff (http://www.daemonology.net/bsdiff/) at work for some internal stuff and it's great. Furthermore, since unstable has gone to using diffs for the Packages files, my dselect updates have been *way* faster. Having the actual downloads go faster as well would be awesome. Some work will have to go into the math to determine when it's actually more efficient to download the latest archive, etc just a fleeting mental note, the threshold should not be 100% of the full archives size, it should be 90 or 80% due to the CPU/RAM overhead of patching and the bandwidth/latency overhead of requesting multiple patch files vs. one stream of data. Cheers, Tyler -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]
proposal for a more efficient download process
Dear Debian-Developers All Over The World! may i introduce my, proposal for a more efficient download process I. the reason why i suggest a patch-oriented download process II. a brief description II.A. on the users side II.B. on the upload and storage side I. the reason why i suggest a patch-oriented download process - downloading a huge deb-package can sometimes be painful, especially when people have only access to a slow internet connection; painful e.g. when security fixes are made to the open-office packages. so this leads to what i call a extra-copy with just some kb of changes. this also is painful for those who have to download from sid, to have the latest state of development. this is not a critic on apt or dpkg! no. apt and dpkg are one of the reason why i use debian. but i think the lack of an efficient download process can be fixed. i even believe this idea is not new and already included in other distributions and also on the mind of many debian developers and users (e.g.: me): II. a brief description --- please let me explain what is on my mind. it may or may not be a good idea. i don't claim to be a professional but want to share my thoughts. thank's! II.A. on the users side --- apt and probably dpkg need of course some changes. but as i believe these changes aren't that big. so how to patch a package when there is no local copy of an old one? there is a local copy of the old one: the installed one! so there is a way to reproduce the old package to it's almost original state, mentioning the conffiles which get manipulated through the install-process. so i suggest not a deb-package oriented patching but a file oriented. conffiles should just get replaced with the original or new version. the other files mainly can be patched. the deb-package interna md5sum then can be used to verify the originality of the new package. please have a look at 'dpkg-repack' by joeyh. and after patching, the package can be foisted on dpkg. so i think dpkg needs no hacks. apt has to care about the efficient download- and patchment process. this can lead to this: -- if a patch is available: 1. look in /var/cache/apt/packages for the package to be updated. if the old one is there patch it's files. md5sum. happy? if not... 2. try to repack the package with dpkg-repack. patch the files. md5sum. if no success... 3. download the whole package. not happy, but well. or download the current tree-package and apply all patches. II.B. on the upload and storage side the upload process may need some more changes though (e.g.: for automation). if this ever comes true, there will have to be a period of time where both, the old way and this way have to work, of course. this will lead to the fact that there is more space required to store the packages and patches; i am sure about this! then there is also the question on how to make the patches available. i believe things can be left as they are and let apt resolve the download of patches. in the end, obviosly there only will be meta-packages representing the original and new package. so things on the users side can be left as they are. the user only will experience a faster download. proposal end -- greetings from austria * curt manucredo <[EMAIL PROTECTED]> -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]