Wiki - https://fedoraproject.org/wiki/Changes/Hardlink_identical_files_in_packages_by_default Discussion thread - https://discussion.fedoraproject.org/t/f43-change-proposal-hardlink-identical-files-in-packages-by-default-self-contained/160769
This is a proposed Change for Fedora Linux. This document represents a proposed Change. As part of the Changes process, proposals are publicly announced in order to receive community feedback. This proposal will only be implemented if approved by the Fedora Engineering Steering Committee. == Summary == A post-build step is added to the package build macros to automatically hardlink all identical files under `/usr`. Previously, this was done in some packages and now it's done everywhere by default. == Owner == * Name: [[User:zbyszek|Zbigniew Jędrzejewski-Szmek]] * Email: zbyszek at in.waw.pl == Detailed Description == Files can be hardlinked at the end of the `%install` step in package builds. rpm supports this and will preserve those links in the binary rpm and during installation. This makes the installation a bit more efficient. Hardlinking of read-only files is generally transparent to the user, but has some small benefits: the files are not duplicated in the file system; backup, copy, and search programs will usually make use of the link information and not process the same inode twice. Thus, it's good to hardlink as many packaged files as possible. Previously, hardlinking was done automatically for a subset of files in Python packages (via the `%__os_install_post_python` macro), and explicitly in some packages with lots of similar files (usually via the `hardlink` program). The `%__os_install_post` is extended to automatically hardlink all identical files under `%{buildroot}%{_prefix}`, i.e. the `/usr` directory in packages. This calls a new helper binary (part of the `add-determinism` package) that does the linking. Hard links may be confusing if the file is ''modified''. In particular, all links to the same inode share the same ownership and permissions, and obviously the same contents. Thus, we want to apply hardlinking only to files under `/usr`, which are generally read-only in packages. When files are hardlinked, mtime (the modification timestamp) is taken into account. Only files with identical mtime, owner, group, and mode are subject to linking. The new program written to do the linking takes `$SOURCE_DATE_EPOCH` into account, and will clamp mtimes to it before comparing. Note: rpm correctly handles the case where a hardlink is between files in two different subpackages. Thus, we can hardlink everything under `%{buildroot}`, and rpm will store the files as hardlinked if they are in the same output package, adjusting the hardlink counts as appropriate. == Feedback == <!-- Summarize the feedback from the community and address why you chose not to accept proposed alternatives. This section is optional for all change proposals but is strongly suggested. Incorporating feedback here as it is raised gives FESCo a clearer view of your proposal and leaves a good record for the future. If you get no feedback, that is useful to note in this section as well. For innovative or possibly controversial ideas, consider collecting feedback before you file the change proposal. --> == Benefit to Fedora == As mentioned in the Summary, hardlinking deduplicates the data in rpms and in installations. Backup, copy, and search programs will usually make use of the link information and not process the same inode twice. Thus, by hardlinking files in the packages we make things a bit more efficient. (The impact is small, because rpms generally don't have large duplicated files.) Hardlinking of files was previously done in some packages explicitly, but it required adding a `BuildRequires` line and invoking a script, so it wasn't done very often. By handling this automatically, we'll be able to simplify those packages. Another caveat that needs to be taken into account when doing hardlinking as part of the package build is that newer `hardlink` versions use reflinks instead of hardlinks by default. (With a hardlink, one inode is connected to the file system tree in two or more places. With a reflink, some blocks of an inode are shared with another inode, ''inside'' of the file system, and the two inodes retain their separate identities.) rpm has no knowledge of reflinks, so those reflinks created during package build have no effect on the binary package and the payload is duplicated. Invocations of `hardlink` would have to be annotated with `--reflink=never` to retain the intended effect. By removing that step from packages we avoid this issue. The [https://docs.fedoraproject.org/en-US/reproducible-builds/ Reproducible Builds] effort reported that some packages that use hardlinking are not reproducible, see [https://pagure.io/fedora-reproducible-builds/project/issue/22 irreproducibility#22]. When files are created in the package build, depending on how fast the build machine is, some files might or might not have identical timestamps. The tools that were used to compare files for hardlinking were general tools that did not "know" that we'd clamp the mtimes to `$SOURCE_DATE_EPOCH` in a subsequent step, so the results of the mtime comparisons were unstable. The tool that is added as part of this Change does the mtime clamping internally for reproducible results. Fixing this issue was the initial motivation for this change. == Scope == * Proposal owners: ** extend the `add-determinism` package with a little helper that does file comparisons and hardlinks identical files. The helper takes `$SOURCE_DATE_EPOCH` into account. ** open pull request for `redhat-rpm-config` to insert a call to the helper in `%__os_install_post`. ** open pull request for `python-srpm-macros` to drop their hardlinking step. * Other developers: ** merge pull request ** report issues if the hardlinking has unforeseen consequences or does not work correctly. ** drop explicit calls to `hardlink` in their packages. * Release engineering: * Policies and guidelines: not needed, AFAICT. * Trademark approval: N/A (not needed for this Change) * Alignment with the Fedora Strategy: == Upgrade/compatibility impact == No impact. == Early Testing (Optional) == Build package with an invocation of the new helper. == How To Test == Install packages rebuilt with the helper. == User Experience == Not visible to users. == Dependencies == == Contingency Plan == * Contingency mechanism: ** if hardlinking causes a problem in some specific packages, they can be trivially modified to skip the hardlinking step by setting a macro. ** if there is a general problem, we can easily drop the macro in `redhat-rpm-config`. * Contingency deadline: any time, even after release. Any affected packages would have to be rebuilt. * Blocks release? No. == Documentation == The invocation of the helper will be documented inline in the macros files. Other documentation is not needed. == Release Notes == Package builds automatically hardlink identical files. This reduces the installation footprint a bit and also makes packages builds more reproducible. -- Aoife Moloney Fedora Operations Architect Fedora Project Matrix: @amoloney:fedora.im IRC: amoloney -- _______________________________________________ devel-announce mailing list -- devel-annou...@lists.fedoraproject.org To unsubscribe send an email to devel-announce-le...@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel-annou...@lists.fedoraproject.org Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue -- _______________________________________________ devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-le...@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue