Hi, I'm reviving this old bug as this came recently up again in the context of ReproducibleBuilds.
On Sat, 26 Nov 2011 12:06:42 +0100 Helmut Grohne <[email protected]> wrote: > The actual problem > ~~~~~~~~~~~~~~~~~~ > Problems with Installed-Size are not exactly new as discussion in > http://bugs.debian.org/534408 (unit for Installed-Size) and > http://bugs.debian.org/630533 (usage of du --apparent-size) have shown. > So what is different this time? Installing the very same package on a > btrfs yields a size that is much closer to the listed Installed-Size. (I > don't have any numbers on this.) So whatever dpkg puts into this field, > it *will* be wrong somewhere. The policy already mentions that this > estimate cannot be accurate everywhere, but in fact it will be wrong by > a factor of at least 2.5 (=sqrt(8)) or a difference of at least 50MB > (=100MB/2) somewhere. Any attempt to change the computation of this > value thus cannot fix this bug. > > Discussion > ~~~~~~~~~~ > In the example of libjs-mathjax the reason for the huge difference is > the inclusion of a large number of very small files. Some filesystems > allocate a block for each of these files and others are able to store > multiple files in a block. A simple approach could be to include an > additional field ("Installed-Files"?) that returns the number of files > in the package. A second estimate for the Installed-Size would then be > given by the number of files times the block size. The maximum of both > estimates could be used. It would solve the immediate symptoms with > libjs-mathjax. It is not without problems though. For instance I > did not explain what block size to use. An administrator may have > different file systems set up for / and /usr. Also the question remains > whether this feature is worth the associated effort. > > To get discussion going I pull in [email protected]. we did some brainstorming in #debian-reproducible over the past days. I'll try to summarize the discussion and Helmut can chip in if I missed something. The fundamental problem is, that there are many ways that the target file system on which the binary package gets installed can influence the size that installing the package requires. This includes but is not limited to: - support for sparse files - inlining data inside the inode - compression - block-level or file-level duduplication Additionally, disk usage can even grow when files are removed due to: - snapshots - overlay file system Helmut argues, that an additional field like Installed-Files can improve the approximation for file systems with different block sizes or whether or not they can store multiple small files in a single block. This solution could be extended to storing groups of files with similar size in exponentially growing intervals of size (like: 4^(n-1) <= size < 4^n) and then storing the number of files and cumulative number of bytes occupied by these files in each of these sets. But this can still not account for sparse files, compression or deduplication. It is also worth asking what functionality the Installed-Size field is supposed to have when looking for a solution. It's primary purpose is probably to give apt a clue of whether or not there is enough free space to install a certain package. Helmut notes that other uses of the Installed-Size field are made by debian-reference, popularity-contest, deborphan and cupt. I would argue, that the only way to reliably solve this problem is either by: 1) an over approximation of the actual value which will be larger than the actual file system usage on any common file system 2) a way of apt or dpkg to ask the file system if there is enough space to store a certain file/directory structure. Most file systems (if any) do not offer this, though. I think that an over approximation would be the right way to go because it is better to wrongly warn the user that a binary package might not be installable due to not sufficient remaining disk space, than to install a package without sufficient remaining disk space and only fail once there actually is no more space. The addition of the `--apparent-size` argument to the du call in dpkg as a response to bug #630533 made the value of the Installed-Size field too small in some situations as can be seen in this bug report. The bug report in #630533 argues, thaht --apparent-size should be used precisely because there are file systems that can store many small files more efficient. Because of my argument in the last paragraph, I'd argue the opposite. The change was then applied with guillem arguing that --apparent-size should be used because of consistency between package rebuilds. But the --apparent-size argument is not sufficient to provide this consistency. Running `du -k -s --apparent-size` (the command currently used by dpkg) on an unpacked mathjax source in an ext4 and btrfs file system, will report different values for them. This is detrimental to the goals of the ReproducibleBuilds efforts. I thus propose that dpkg implements something of the following functionality which (if I didn't miss to test something) will give an overapproximation for Installed-Size and at the same time will be reproducible across different file systems: ( find mathjax-2.4 -type f -print0 \ | du --files0-from=- -b; \ find mathjax-2.4 \! -type f -printf "1\n" ) \ | awk '{total = total + int($1/4096) + 4096}END{print total}' I'm not proposing this code to be part of dpkg but I'm posting it because code is precise but words are not. The same can probably easily be implemented in perl. The above snippet will get the number of bytes from all regular files and treat all non-regular file entries as being 1 byte small. The following awk call will then round all these values up to multiples of (an arbitrarily picked) block size of 4096 bytes. I think that finding the right solution to this problem requires to define what the purpose of the Installed-Size field is. If it is to prevent package installations on systems where there is not enough space, then I think an overapproximation is the right way to go. More complicated measures will still not be able to give a good approximation, given the feature-rich-ness of today's file systems. What do you think? Thanks! cheers, josch -- To UNSUBSCRIBE, email to [email protected] with a subject of "unsubscribe". Trouble? Contact [email protected] Archive: https://lists.debian.org/20150107112247.27069.67974@hoothoot

