Re: proposal for a more efficient download process

2006-06-01 Thread Eduard Bloch
#include 
* curt manucredo (hansycm) [Fri, May 26 2006, 07:53:58PM]:

> this can lead to this:
> --
> if a patch is available:
> 
> 1. look in /var/cache/apt/packages for the package to be updated. if the 
> old one is there patch it's files. md5sum. happy? if not...
> 
> 2. try to repack the package with dpkg-repack. patch the files. md5sum. 
> if no success...

Repacking may suck, as pointed out by others. Why not just modify dpkg?
I imagine a new kind of "package-diff" packages, containing diffs
instead of real files. Eg. for the last n versions of a package (or for
m versions in a certain time frame). This way files would be created
smoothly on-the-fly. Of course the special dpkg program would use
debsums first to check the integrity of the installed package contents
before trying to patch them. And of course it would only continue after
the contents have been copied and patched successfully in a separate
location.

Eduard.


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]



Re: proposal for a more efficient download process

2006-06-01 Thread Frank Küster
A Mennucc <[EMAIL PROTECTED]> wrote:

> Absolutely true. Look at this
>
> $ ls -s tetex-doc_3.0-17_all.deb tetex-doc_3.0-18_all.deb
>  42388 tetex-doc_3.0-18_all.deb 42340 tetex-doc_3.0-17_all.deb
>
> $ bsdiff tetex-doc_3.0-17_all.deb tetex-doc_3.0-18_all.deb brutal.bsdiff
> $ ls -s brutal.bsdiff
>  10092 brutal.bsdiff
>
> Hat tip to 'bsdiff', but we can do better...
>
> $ ar p tetex-doc_3.0-17_all.deb data.tar.gz | zcat >  /tmp/17.tar
> $ ar p tetex-doc_3.0-18_all.deb data.tar.gz | zcat >  /tmp/18.tar
> $ ls -s /tmp/17.tar /tmp/18.tar
>
> 53532 /tmp/17.tar  53580 /tmp/18.tar
>
> $ time bsdiff /tmp/17.tar /tmp/18.tar /tmp/tar.bsdiff
>
> times: 
>  real2m4.994s user2m3.947s
> memory:
>   PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
>  9784 debdev25   0  471m 470m 1384 T  0.0 46.5   1:18.82 bsdiff
> size:
>   92 /tmp/tar.bsdiff 

I guess this is 92 kByte?

> so as you see, the reduction in size is impressive, 
> but it uses too much memory  and takes too much time.

Don't know whether this is in fact a typical example in terms of memory
consumption, because of:

tetex-base (3.0-18) unstable; urgency=low

  [...]
  * Move the documentation from /usr/share/doc/texmf/ to
/usr/share/doc/tetex-doc and let the symlink point to the new
location, in accordance with new policy, and to allow parallel
installation of some texlive packages.

So nearly each file that existed in 3.0-17 is at a new location in
3.0-18.  It's impressive that bsdiff is able to notice that and reduce
the diff to such a small size.  The size is really small, especially
because of:

  * Add a PDF documentation file for pst-poly which is only present as
LaTeX source [frank]

and 

ls -l /usr/share/texmf-tetex/doc//generic/pstricks/pst-poly.pdf.gz 
-rw-r--r--  1 root root 115290 2004-11-21 07:51 
/usr/share/texmf-tetex/doc//generic/pstricks/pst-poly.pdf.gz

Regards, Frank
-- 
Frank Küster
Single Molecule Spectroscopy, Protein Folding @ Inst. f. Biochemie, Univ. Zürich
Debian Developer (teTeX)



Re: proposal for a more efficient download process

2006-06-01 Thread A Mennucc

hi

by quite a coincidence, while you people were discussing this idea, I was 
already implementing it, in a package called 'debdelta' : see
 http://lists.debian.org/debian-devel/2006/05/msg03120.html

Moreover, by some telepathy :-) , I already included features you were
proposing, and addressed problems you where discussing
(and other problems you were not discussing since you did not
try implementing it  :-) 

Here are the replies:

To curt manucredo : while the implementation is not exactly what you
were suggesting in your original email, it still achieves all desired
goals; moreover, it is alive an kicking.

'debdelta' differs from your implementation in this respect:
- it does not use dpkg-repack (for many good reasons, see below)
- it recreates the new .deb , and guarantees that it is equal to the 
  one in archives, so archive signatures can be verified;
  currently it does not patch into the filesystem 
  (altough  this would be an easy adaptation, if anybody wishes for it)

'debdelta' conforms to your requests, in that 
- it can recreate the new .deb, either using the installed version of
 the old .deb, or old .deb file.

On the bright side, everything is already working, there is already
a repository of patches available, and a method of downloading them.

To Tyler MacDonald :
 - 'debdelta' uses 'bsdiff' , or 'xdelta' as a fallback, see below
 - regarding this:
> Some work will have to go into the math to determine when it's
> actually more efficient to download the latest archive, etc just a
> fleeting mental note, the threshold should not be 100% of the full archives
> size, it should be 90 or 80% due to the CPU/RAM overhead of patching and the
> bandwidth/latency overhead of requesting multiple patch files vs. one
> stream of data.
This math must go in the client side, and it is in my TODO list
(see at the end of the README); it is a nice exercise in Dynamical Programming.

Anyway , currently the archive discards deltas that exceed ~50% of the
new .deb , just as an heuristic, and to keep disk usage low.

To Goswin von Brederlow :
>| bsdiff is quite memory-hungry. It requires max(17*n,9*n+m)+O(1)

Ah so this is the correct formula! The man page just says '17*n'.

But  in my stats, that that is not the case; my stats
are estimating that the memory is '12*n' so that is what I use

>| bytes of memory, where n is the size of the old file and m is the
>| size of the new file. bspatch requires n+m+O(1) bytes.
> That is quite unacceptable. We have debs in debian up to 160Mb

'debdelta' has an option '-M ' to choose between 'xdelta' and 'bsdiff' ;
by default, it uses 'xdelta' when memory usage would exceed 50Mb ;
but in the server, I set '-M 200' since I have 1GB RAM there.
 
> Seems to be quite useless for patching full debs. One would have to
> limit it to a file-by-file approach.

This is in my TODO list. Actually, I have in mind a scheme to
break TARs at suitable points, I have to check if it is 
worthwhile ; I can discuss details.

To: Tyler MacDonald again:
>   True.. It'd probably only be efficient if the deltas were based on
> the contents of the .deb's before they're packed.

.. and this is the reason why I do not use dpkg-repack... why unpacking
data when I need them unpacked ?   :-)

Absolutely true. Look at this

$ ls -s tetex-doc_3.0-17_all.deb tetex-doc_3.0-18_all.deb
 42388 tetex-doc_3.0-18_all.deb 42340 tetex-doc_3.0-17_all.deb

$ bsdiff tetex-doc_3.0-17_all.deb tetex-doc_3.0-18_all.deb brutal.bsdiff
$ ls -s brutal.bsdiff
 10092 brutal.bsdiff

Hat tip to 'bsdiff', but we can do better...

$ ar p tetex-doc_3.0-17_all.deb data.tar.gz | zcat >  /tmp/17.tar
$ ar p tetex-doc_3.0-18_all.deb data.tar.gz | zcat >  /tmp/18.tar
$ ls -s /tmp/17.tar /tmp/18.tar

53532 /tmp/17.tar  53580 /tmp/18.tar

$ time bsdiff /tmp/17.tar /tmp/18.tar /tmp/tar.bsdiff

times: 
 real2m4.994s user2m3.947s
memory:
  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
 9784 debdev25   0  471m 470m 1384 T  0.0 46.5   1:18.82 bsdiff
size:
  92 /tmp/tar.bsdiff 

so as you see, the reduction in size is impressive, 
but it uses too much memory  and takes too much time.

$ time xdelta delta -m 50M -9  /tmp/17.tar /tmp/18.tar /tmp/tar.xdelta
times:
 real0m1.728s user0m1.660s
memory...  it is too fast
size:
  236 /tmp/tar.xdelta

still good enough for our goal



Comparing to the above

$ ls -s pool/main/t/tetex-base/tetex-doc_3.0-17_3.0-18_all.debdelta

288 pool/main/t/tetex-base/tetex-doc_3.0-17_3.0-18_all.debdelta

(the extra 35kB are the script that 'debpatch' uses  :-( 
 actually, I told 'debdelta' to use 'bzip' instead of gzip
 in this cases, but it did not... just found another bug :-)  )

To:  Marc 'HE' Brockschmidt <[EMAIL PROTECTED]>:
> Now the interesting questions: How many diffs do you keep?

very few, currently, due to space constraints; moreover , suppose that

 you have a_1.deb installed, a_1_2.debdelta and  a_2_3.debdelta are in
 pool of deltas, wan

Re: proposal for a more efficient download process

2006-05-28 Thread Goswin von Brederlow
"curt manucredo (hansycm)" <[EMAIL PROTECTED]> writes:

> Marc 'HE' Brockschmidt <[EMAIL PROTECTED]> wrote:
>
>  >Nope. You will need to keep all normal debs anyway, for new
>  >installations.
>
> i thought it could be possible in the end to download the tree-package
> and all its patches to then have the latest package for a new install!
> so i thought there will be no more need for a lot of full packages. is
> it not? one of the advantages could be that you have more versions
> available then just the latest - this would be great for sid!

But stupid for stable and, since testing is the testbed for the next
stable, for testing also. You need a full deb there to build proper
CDs and DVDs. And since you don't know what version will make it into
stable beforehand you have to save the full deb of every version.

MfG
Goswin


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]



re: proposal for a more efficient download process

2006-05-28 Thread curt manucredo (hansycm)

Marc 'HE' Brockschmidt <[EMAIL PROTECTED]> wrote:

>Anyway, this was proposed some times now. Have you actually read the >old
>threads and can explain why your proposal is better and actually works?
>Why haven't you implemented it yet?

not right now. i just have found out that there were some same 
discustions about it just some days before. sorry. i never have claimed 
that to be my idea. i just sad it is a proposal. since i am new on 
debian-devel i will probably have to find out even more. so please let 
me have a chance to do so! and i never said my proposal will work 
though. well, i just thought i would have come up with a new idea. how 
stupid! :-)


--
greetings from austria

well, though i think i can't fix that problem, but i believe i can make 
a workaround!

*
curt manucredo
[EMAIL PROTECTED]

"Only two things are infinite, the universe and human stupidity,
and I'm not sure about the former." -- Albert Einstein
--


--
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]



re: proposal for a more efficient download process

2006-05-28 Thread curt manucredo (hansycm)

Marc 'HE' Brockschmidt <[EMAIL PROTECTED]> wrote:

>Nope. You will need to keep all normal debs anyway, for new
>installations.

i thought it could be possible in the end to download the tree-package 
and all its patches to then have the latest package for a new install! 
so i thought there will be no more need for a lot of full packages. is 
it not? one of the advantages could be that you have more versions 
available then just the latest - this would be great for sid!


>Now the interesting questions: How many diffs do you keep?

i thought of keeping the tree-package and its patches as long it makes 
sence. for example if there is a next-version package and the patches 
would grow to big, there will come up a new tree-package. well, yes, it 
is difficult to think this through, but anyway!


>How do you
>integrate this approach with the minimal security Release files give us
>today? What about the kind of signatures dpkg-sig provides?

sure. this proposal would require a lot of changes not just a few. but 
as i have suggested not to make a .deb oriented but a file oriented 
patchment, the new package will be created on the users system with the 
downloaded patch(es). so in the end, there will be a .deb package in the 
cache and it will just install as always. if you make a package-mirror- 
update to look for updates, it just will show there is a new package.

the user will not find out that it just downloads the patches.
hope that answers your question. i am not quite sure. i am new!
so please try to ask in another way if this does not satisfy you!
thank's :-)
--
greetings from austria

well, though i think i can't fix that problem, but i believe i can make 
a workaround!

*
curt manucredo
[EMAIL PROTECTED]

"Only two things are infinite, the universe and human stupidity,
and I'm not sure about the former." -- Albert Einstein
--


--
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]



Re: proposal for a more efficient download process

2006-05-27 Thread Goswin von Brederlow
Tyler MacDonald <[EMAIL PROTECTED]> writes:

> Goswin von Brederlow <[EMAIL PROTECTED]> wrote:
>> That is quite unacceptable. We have debs in debian up to 160Mb
>> (packed) and 580Mb unpacked. That would require 2.7 Gb and nearly 10Gb
>> ram respectively.
>> 
>> Seems to be quite useless for patching full debs. One would have to
>> limit it to a file-by-file approach.
>
>   True.. It'd probably only be efficient if the deltas were based on
> the contents of the .deb's before they're packed.

That is pretty much a given anyway imho.

MfG
Goswin


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]



Re: proposal for a more efficient download process

2006-05-27 Thread Marc 'HE' Brockschmidt
"curt manucredo (hansycm)" <[EMAIL PROTECTED]> writes:
> II.B. on the upload and storage side
> 
>
> the upload process may need some more changes though (e.g.: for 
> automation). if this ever comes true, there will have to be a period of 
> time where both, the old way and this way have to work, of course.

Nope. You will need to keep all normal debs anyway, for new
installations.
Now the interesting questions: How many diffs do you keep? How do you
integrate this approach with the minimal security Release files give us
today? What about the kind of signatures dpkg-sig provides?

Anyway, this was proposed some times now. Have you actually read the old
threads and can explain why your proposal is better and actually works?
Why haven't you implemented it yet?

Marc
-- 
Fachbegriffe der Informatik - Einfach erklärt (176: NT-Consulter)
   italienische Ledertreter, Achselschweiß. Erklärt Probleme dadurch, daß
   man nicht die richtigen Kurse in Unterschleißheim belegt hat und sich
   dies sofort rächt. (Anders Henke)


pgp4ZeDkRde4E.pgp
Description: PGP signature


Re: proposal for a more efficient download process

2006-05-27 Thread Tyler MacDonald
Goswin von Brederlow <[EMAIL PROTECTED]> wrote:
> That is quite unacceptable. We have debs in debian up to 160Mb
> (packed) and 580Mb unpacked. That would require 2.7 Gb and nearly 10Gb
> ram respectively.
> 
> Seems to be quite useless for patching full debs. One would have to
> limit it to a file-by-file approach.

True.. It'd probably only be efficient if the deltas were based on
the contents of the .deb's before they're packed.


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]



Re: proposal for a more efficient download process

2006-05-26 Thread Goswin von Brederlow
Tyler MacDonald <[EMAIL PROTECTED]> writes:

>   +1. We've been using bsdiff (http://www.daemonology.net/bsdiff/) at
> work for some internal stuff and it's great.

Oh, and one more thing:

| bsdiff is quite memory-hungry. It requires max(17*n,9*n+m)+O(1)
| bytes of memory, where n is the size of the old file and m is the
| size of the new file. bspatch requires n+m+O(1) bytes.

That is quite unacceptable. We have debs in debian up to 160Mb
(packed) and 580Mb unpacked. That would require 2.7 Gb and nearly 10Gb
ram respectively.

Seems to be quite useless for patching full debs. One would have to
limit it to a file-by-file approach.

MfG
Goswin


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]



Re: proposal for a more efficient download process

2006-05-26 Thread Goswin von Brederlow
Tyler MacDonald <[EMAIL PROTECTED]> writes:

> http://www.daemonology.net/bsdiff/


How does that compare with rsync batch files?

MfG
Goswin


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]



Re: proposal for a more efficient download process

2006-05-26 Thread Tyler MacDonald
> I. the reason why i suggest a patch-oriented download process

+1. We've been using bsdiff (http://www.daemonology.net/bsdiff/) at
work for some internal stuff and it's great. Furthermore, since unstable has
gone to using diffs for the Packages files, my dselect updates have been
*way* faster. Having the actual downloads go faster as well would be
awesome.

Some work will have to go into the math to determine when it's
actually more efficient to download the latest archive, etc just a
fleeting mental note, the threshold should not be 100% of the full archives
size, it should be 90 or 80% due to the CPU/RAM overhead of patching and the
bandwidth/latency overhead of requesting multiple patch files vs. one
stream of data.

Cheers,
Tyler


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]



proposal for a more efficient download process

2006-05-26 Thread curt manucredo (hansycm)

Dear Debian-Developers All Over The World!

may i introduce my,

proposal for a more efficient download process


I. the reason why i suggest a patch-oriented download process
II. a brief description
II.A. on the users side
II.B. on the upload and storage side


I. the reason why i suggest a patch-oriented download process
-

downloading a huge deb-package can sometimes be painful, especially when 
people have only access to a slow internet connection; painful e.g. when 
security fixes are made to the open-office packages. so this leads to 
what i call a extra-copy with just some kb of changes. this also is 
painful for those who have to download from sid, to have the latest 
state of development. this is not a critic on apt or dpkg! no. apt and 
dpkg are one of the reason why i use debian. but i think the lack of an 
efficient download process can be fixed. i even believe this idea is not 
new and already included in other distributions and also on the mind of 
many debian developers and users (e.g.: me):


II. a brief description
---

please let me explain what is on my mind. it may or may not be a good 
idea. i don't claim to be a professional but want to share my thoughts. 
thank's!



II.A. on the users side
---

apt and probably dpkg need of course some changes. but as i believe 
these changes aren't that big. so how to patch a package when there is 
no local copy of an old one? there is a local copy of the old one: the 
installed one! so there is a way to reproduce the old package to it's 
almost original state, mentioning the conffiles which get manipulated 
through the install-process. so i suggest not a deb-package oriented 
patching but a file oriented. conffiles should just get replaced with 
the original or new version. the other files mainly can be patched. the 
deb-package interna md5sum then can be used to verify the originality of 
the new package. please have a look at 'dpkg-repack' by joeyh. and after 
patching, the package can be foisted on dpkg. so i think dpkg needs no

hacks. apt has to care about the efficient download- and patchment process.

this can lead to this:
--
if a patch is available:

1. look in /var/cache/apt/packages for the package to be updated. if the 
old one is there patch it's files. md5sum. happy? if not...


2. try to repack the package with dpkg-repack. patch the files. md5sum. 
if no success...


3. download the whole package. not happy, but well. or download the 
current tree-package and apply all patches.


II.B. on the upload and storage side


the upload process may need some more changes though (e.g.: for 
automation). if this ever comes true, there will have to be a period of 
time where both, the old way and this way have to work, of course. this 
will lead to the fact that there is more space required to store the 
packages and patches; i am sure about this! then there is also the 
question on how to make the patches available. i believe things can be 
left as they are and let apt resolve the download of patches. in
the end, obviosly there only will be meta-packages representing the 
original and new package. so things on the users side can be left as 
they are. the user only will experience a faster download.


proposal end


--
greetings from austria

*
curt manucredo <[EMAIL PROTECTED]>


--
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]