commit: e4dc2627c8107339b13e20709125e2d9fc91ffde
Author: Michał Górny <mgorny <AT> gentoo <DOT> org>
AuthorDate: Wed Feb 7 13:20:45 2018 +0000
Commit: Michał Górny <mgorny <AT> gentoo <DOT> org>
CommitDate: Wed Feb 7 13:22:08 2018 +0000
URL: https://gitweb.gentoo.org/data/glep.git/commit/?id=e4dc2627
glep-0075: Extend rationale for splitting algorithm
Extend and refactor the rationale for splitting algorithm. Explicitly
state the goals, list all the options that occurred during the ml
discussion.
glep-0075.rst | 116 +++++++++++++++++++++++++++++++++++++++++++++-------------
1 file changed, 91 insertions(+), 25 deletions(-)
diff --git a/glep-0075.rst b/glep-0075.rst
index 157514e..00d14c3 100644
--- a/glep-0075.rst
+++ b/glep-0075.rst
@@ -187,43 +187,98 @@ Rationale
=========
Algorithm for splitting distfiles
---------------------------------
-In the original debate that occurred in bug #534528 [#BUG534528]_,
-three possible solutions for splitting distfiles were listed:
+The possible algorithms were considered with the following goals
+in mind:
-a. using initial portion of filename,
+- the number of files in a single directory should not exceed 1000,
-b. using initial portion of file hash,
+- the total size of files in a single directory is not considered
+ relevant,
-c. using initial portion of filename hash.
+- the solution should preferably be future-proof,
-The significant advantage of the filename option was simplicity. With
-that solution, the users could easily determine the correct subdirectory
-themselves. However, it's significant disadvantage was very uneven
-shuffling of data. In particular, the TeΧ Live packages alone count
-almost 23500 distfiles and all use a common prefix, making it impossible
-to split them further.
+- moving distfiles should be avoided once it is deployed.
-The alternate option of using file hash has the advantage of having
-a more balanced split. Furthermore, since hashes are stored
-in Manifests using them is zero-cost. However, this solution has three
-significant disadvantages:
+It should also be noted that at this moment the package having most
+distfiles in Gentoo at the time is dev-texlive/texlive-latexextra,
+with the number of 8556 distfiles. All of them start with a common
+prefix of ``texlive-module-``. This specific prefix is used by a total
+of 23435 distfiles.
-1. The hash values are unknown for newly-downloaded distfiles, so
- ``repoman`` (or an equivalent tool) would have to use a temporary
- directory before locating the file in appropriate subdirectory.
+In the original debate that occurred in bug #534528 [#BUG534528]_
+and the mailing list review of the initial version of this GLEP [#ML1]_,
+four fundamental ideas for splitting distfiles were listed:
+
+a. using initial portion of filename,
+
+b. using initial portion of file hash,
+
+c. using initial portion of filename hash,
+
+d. using package category (and package name).
+
+The initial filename idea was to use the first character of filename,
+possibly followed by a longer part which was the idea historically
+used e.g. by PyPI Python package hosting. Its main advantage is
+simplicity. The users can easily determine the correct subdirectory
+by just looking at the distfile name. Sadly, this solution is not only
+very uneven but does not solve the problem. As mentioned above,
+the TeΧ Live packages share a long common prefix that make it impossible
+to split it properly with other packages on fixed-length prefixes.
+
+This idea has been followed by an adaptive proposal by Andrew Barchuk
+[#ADAPTIVE_FILENAME]_. In this proposal, the filenames are not strictly
+mapped to groups by a common prefix but instead each group contains
+all files between two prefixes being used (like in a dictionary).
+However, it has been pointed out that while this option can provide
+very even results initially, it is impossible to predict how it would
+be affected by future distfile changes and there will be a risk of
+needing to change the groups in the future. Furthermore, it is
+relatively complex and requires explicitly listing or obtaining used
+groups.
+
+Another option was to use an initial portion of distfile hashes. Its
+main advantage is that cryptographic hash algorithms can provide
+a more balanced split with random data. Furthermore, since hashes are
+stored in Manifests using them has no cost for users. However, this
+solution has three disadvantages:
+
+1. Not all files in the distfile tree are covered by package Manifests.
+ Additional files are injected into the mirrors, and those will
+ not have a clearly-defined location.
2. User-provided distfiles (e.g. for fetch-restricted packages) with
hash mismatches would be placed in the wrong subdirectory,
potentially causing confusing errors.
-3. Not all files in the distfiles tree are covered by package Manifests
- --- there are additional files that are injected into distfiles.
+3. The hash values are unknown for newly-downloaded distfiles, so
+ ``repoman`` (or an equivalent tool) would have to use a temporary
+ directory before locating the file in appropriate subdirectory.
-Using filename hashes has proven to provide a similar balance
-to using file hashes. Furthermore, since filenames are known up front
-this solution does not suffer from the both listed problems. While
-hashes need to be computed manually, hashing short string should not
-cause any performance problems.
+Using filename hashes has proven to provide a similar balance to using
+file hashes. Furthermore, since filenames are known up front this
+solution does not suffer from the listed problems. While hashes need
+to be computed manually, hashing short string should not cause
+any performance problems.
+
+Jason Zaman has suggested to use package categories (and package names)
+[#PKGNAME]_. However, this solution has multiple problems:
+
+a. it does not solve the problem for large packages such as TeΧ Live,
+
+b. it introduces many unnecessarily small directories,
+
+c. it requires an explicit knowledge of which package distfiles
+ belong to,
+
+d. it does not provide an explicit solution to the problem of distfiles
+ shared by multiple packages,
+
+e. it does not provide a solution to the problem of injected distfiles.
+
+All the options considered, the filename hash solution was selected
+as one that solves all the forementioned problems while introducing
+relatively low complexity and being reasonably future-proof.
.. figure:: glep-0075-extras/by-filename.png
@@ -327,6 +382,17 @@ References
of DISTDIR
(https://bugs.gentoo.org/534528)
+.. [#ML1] [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure
+
(https://archives.gentoo.org/gentoo-dev/message/cfc4f8595df2edf9a25ba9ecae2463ba)
+
+.. [#ADAPTIVE_FILENAME] Andrew Barchuk's reply on 'using character ranges
+ for each directory computed in a way to have the files distributed evenly'
+
(https://archives.gentoo.org/gentoo-dev/message/611bdaa76be049c1d650e8995748e7b8)
+
+.. [#PKGNAME] Jason Zamal's reply including 'using the same dir layout
+ as the packages themselves)
+
(https://archives.gentoo.org/gentoo-dev/message/f26ed870c3a6d4ecf69a821723642975)
+
Copyright
=========