commit:     e4dc2627c8107339b13e20709125e2d9fc91ffde
Author:     Michał Górny <mgorny <AT> gentoo <DOT> org>
AuthorDate: Wed Feb  7 13:20:45 2018 +0000
Commit:     Michał Górny <mgorny <AT> gentoo <DOT> org>
CommitDate: Wed Feb  7 13:22:08 2018 +0000
URL:        https://gitweb.gentoo.org/data/glep.git/commit/?id=e4dc2627

glep-0075: Extend rationale for splitting algorithm

Extend and refactor the rationale for splitting algorithm. Explicitly
state the goals, list all the options that occurred during the ml
discussion.

 glep-0075.rst | 116 +++++++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 91 insertions(+), 25 deletions(-)

diff --git a/glep-0075.rst b/glep-0075.rst
index 157514e..00d14c3 100644
--- a/glep-0075.rst
+++ b/glep-0075.rst
@@ -187,43 +187,98 @@ Rationale
 =========
 Algorithm for splitting distfiles
 ---------------------------------
-In the original debate that occurred in bug #534528 [#BUG534528]_,
-three possible solutions for splitting distfiles were listed:
+The possible algorithms were considered with the following goals
+in mind:
 
-a. using initial portion of filename,
+- the number of files in a single directory should not exceed 1000,
 
-b. using initial portion of file hash,
+- the total size of files in a single directory is not considered
+  relevant,
 
-c. using initial portion of filename hash.
+- the solution should preferably be future-proof,
 
-The significant advantage of the filename option was simplicity.  With
-that solution, the users could easily determine the correct subdirectory
-themselves.  However, it's significant disadvantage was very uneven
-shuffling of data.  In particular, the TeΧ Live packages alone count
-almost 23500 distfiles and all use a common prefix, making it impossible
-to split them further.
+- moving distfiles should be avoided once it is deployed.
 
-The alternate option of using file hash has the advantage of having
-a more balanced split.  Furthermore, since hashes are stored
-in Manifests using them is zero-cost.  However, this solution has three
-significant disadvantages:
+It should also be noted that at this moment the package having most
+distfiles in Gentoo at the time is dev-texlive/texlive-latexextra,
+with the number of 8556 distfiles.  All of them start with a common
+prefix of ``texlive-module-``.  This specific prefix is used by a total
+of 23435 distfiles.
 
-1. The hash values are unknown for newly-downloaded distfiles, so
-   ``repoman`` (or an equivalent tool) would have to use a temporary
-   directory before locating the file in appropriate subdirectory.
+In the original debate that occurred in bug #534528 [#BUG534528]_
+and the mailing list review of the initial version of this GLEP [#ML1]_,
+four fundamental ideas for splitting distfiles were listed:
+
+a. using initial portion of filename,
+
+b. using initial portion of file hash,
+
+c. using initial portion of filename hash,
+
+d. using package category (and package name).
+
+The initial filename idea was to use the first character of filename,
+possibly followed by a longer part which was the idea historically
+used e.g. by PyPI Python package hosting.  Its main advantage is
+simplicity.  The users can easily determine the correct subdirectory
+by just looking at the distfile name.  Sadly, this solution is not only
+very uneven but does not solve the problem.  As mentioned above,
+the TeΧ Live packages share a long common prefix that make it impossible
+to split it properly with other packages on fixed-length prefixes.
+
+This idea has been followed by an adaptive proposal by Andrew Barchuk
+[#ADAPTIVE_FILENAME]_.  In this proposal, the filenames are not strictly
+mapped to groups by a common prefix but instead each group contains
+all files between two prefixes being used (like in a dictionary).
+However, it has been pointed out that while this option can provide
+very even results initially, it is impossible to predict how it would
+be affected by future distfile changes and there will be a risk of
+needing to change the groups in the future.  Furthermore, it is
+relatively complex and requires explicitly listing or obtaining used
+groups.
+
+Another option was to use an initial portion of distfile hashes.  Its
+main advantage is that cryptographic hash algorithms can provide
+a more balanced split with random data.  Furthermore, since hashes are
+stored in Manifests using them has no cost for users.  However, this
+solution has three disadvantages:
+
+1. Not all files in the distfile tree are covered by package Manifests.
+   Additional files are injected into the mirrors, and those will
+   not have a clearly-defined location.
 
 2. User-provided distfiles (e.g. for fetch-restricted packages) with
    hash mismatches would be placed in the wrong subdirectory,
    potentially causing confusing errors.
 
-3. Not all files in the distfiles tree are covered by package Manifests
-   --- there are additional files that are injected into distfiles.
+3. The hash values are unknown for newly-downloaded distfiles, so
+   ``repoman`` (or an equivalent tool) would have to use a temporary
+   directory before locating the file in appropriate subdirectory.
 
-Using filename hashes has proven to provide a similar balance
-to using file hashes.  Furthermore, since filenames are known up front
-this solution does not suffer from the both listed problems.  While
-hashes need to be computed manually, hashing short string should not
-cause any performance problems.
+Using filename hashes has proven to provide a similar balance to using
+file hashes.  Furthermore, since filenames are known up front this
+solution does not suffer from the listed problems.  While hashes need
+to be computed manually, hashing short string should not cause
+any performance problems.
+
+Jason Zaman has suggested to use package categories (and package names)
+[#PKGNAME]_.  However, this solution has multiple problems:
+
+a. it does not solve the problem for large packages such as TeΧ Live,
+
+b. it introduces many unnecessarily small directories,
+
+c. it requires an explicit knowledge of which package distfiles
+   belong to,
+
+d. it does not provide an explicit solution to the problem of distfiles
+   shared by multiple packages,
+
+e. it does not provide a solution to the problem of injected distfiles.
+
+All the options considered, the filename hash solution was selected
+as one that solves all the forementioned problems while introducing
+relatively low complexity and being reasonably future-proof.
 
 .. figure:: glep-0075-extras/by-filename.png
 
@@ -327,6 +382,17 @@ References
    of DISTDIR
    (https://bugs.gentoo.org/534528)
 
+.. [#ML1] [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure
+   
(https://archives.gentoo.org/gentoo-dev/message/cfc4f8595df2edf9a25ba9ecae2463ba)
+
+.. [#ADAPTIVE_FILENAME] Andrew Barchuk's reply on 'using character ranges
+   for each directory computed in a way to have the files distributed evenly'
+   
(https://archives.gentoo.org/gentoo-dev/message/611bdaa76be049c1d650e8995748e7b8)
+
+.. [#PKGNAME] Jason Zamal's reply including 'using the same dir layout
+   as the packages themselves)
+   
(https://archives.gentoo.org/gentoo-dev/message/f26ed870c3a6d4ecf69a821723642975)
+
 
 Copyright
 =========

Reply via email to