Hi! On Wed, 2009-09-30 at 19:19:01 -0500, Jonathan Nieder wrote: > Guillem Jover wrote: > > I guess a better question is, how much benefit a bigger dictionary size > > would give us?
> Good question. Fedora people have been recently considering a similar > question (they’re focused on speed rather than memory usage, but still > it comes down to dictionary size versus compression ratio). > > From > <http://thread.gmane.org/gmane.linux.redhat.fedora.devel/121067/focus=121116> > we can conclude that once the dictionary is larger than the payload it > doesn’t win us much. ;) > > From > <http://www.advogato.org/person/badger/diary/80.html> we can conclude > that with a reasonably sized and somewhat formulaic text file (an SQL > database dump), preset -3 is good enough. That’s a dictionary size > of 1 MiB. > > For deciding on limits, it would probably be good to experiment with actual > “worst case” Debian packages (maybe openoffice.org). Yeah. I was checking a bit, and found lzip, and its companion lzlib, which seems to have a pretty straight forward API (both packaged in Debian): <http://www.nongnu.org/lzip/lzip.html> <http://www.nongnu.org/lzip/lzlib.html> And this thread with some comparisons (although against an old lzip version) and a link to a blog post with interesting points in favor of it instead of xz: <http://lists.gnu.org/archive/html/lzip-bug/2009-10/msg00000.html> I also found this about the xz endianness problem, but it seems to have been fixed already upstream: <http://www.mail-archive.com/[email protected]/msg08013.html> So it would be nice to consider it as well. > > We can try to specify it, and codify it in the tools, but there's people > > out there building packages with ar and tar... > > Yes, dpkg should not break this way of working. Well, creation of packages that way should not be encouraged either. > > > Related question: If an LZMA-based file format might ever be used > > > for udebs, what are the memory constraints for unpacking those? > > > > Well this is outside the scope of dpkg itself, and more a project wide > > decision, but I'm not sure we'd want any package in the base system > > built with anything but gzip, as that's shared by derivatives, > > embedded distros, etc. xz should probably be used for big packages that > > are guaranteed to be used on desktops or huge boxes (think games, > > openoffice.org, etc). > > Makes sense. xz was developed for an embedded distro, and its memory > usage can be kept under control by using a small dictionary size, but we > probably don’t want to slow down the install too much just for the sake > of smaller packages. Oh right, realized that after having sent the mail and checking around a bit. I guess the problem is that embedded is a too wide target. So some systems with low disk space and reasonable memory might truly benefit from it but ones with the inverse might not. > One can indeed read the amount of memory from the file headers. > Unfortunately, the maximum dictionary size is 4 GiB, and I would think > using 4 GiB of memory to unpack a package even if that’s available would > be bad behavior for dpkg. It is not obvious that examining the contents > of an untrusted package should be considered an unsafe operation (on a > server where this could lead to denial of service, for example). If the package is untrusted then you should better not be installing it anyway. > > OTOH if the package is out of spec we can do whatever we want, but I'd > > rather make dpkg cope with such packages gracefully. > > Agreed. Just to be clear, what I meant was that if it's not going to be possible at all to extract it anyway, it should abort up-front in a controlled way, and not just getting an ENOMEM in the middle of the unpacking. regards, guillem -- To UNSUBSCRIBE, email to [email protected] with a subject of "unsubscribe". Trouble? Contact [email protected]

