[pkg-discuss] Notes on SAT solver as applied to finding packaging solution(s)

Bart Smaalders Wed, 11 Feb 2009 18:35:05 -0800


It's become clear that our approach to evaluating
packaging dependencies and constraints needs to become
a lot more sophisticated; we've been trying to make
packaging metadata more accurately reflect the way
we build and test packages in experiments and
the current solver is not very powerful.


A significant part of the difficulty is dealing with
externally produced packages; if a variety of versions
are available we may need to iteratively test multiple
versions, evaluating their dependencies to find one that
is compatible w/ the constraints that may be active on
the current image.  This may involve extensive back-tracking
and rejection of previously made decisions.  Such changes
would significantly complicate the current code, and produce
even longer run times during package plan creation...

One method of doing this sort of automated decision making
is to cast the problem as a series of boolean expressions,
and then apply a SAT solver to find a solution.  These notes
describe the results of my experiments w/ the minisat solver
Stephen posted last month.  The folks working on libzypp
for OpenSuse are also using a SAT solver to compute their
package solutions, although their mechanisms are quite
different.

Notes:
--------

1) The presence of a particular package version is a
   single boolean variable; True if it's present,
   False if not.

   The problem set handed to the SAT solver is a series
   of clauses; each clause are boolean variables (or
   their negation) or'd together.  All clauses must be
   true for the solution to exist.

   The clauses need to encode the fact that only one version
   of a package may be installed at a time, and also encode
   all differnt package dependencies and constraints.

2) Each package has some number of versions, inherently ordered.
   Only one version of a package may be installed at a time

   pkg a -> a.1, a.2, a.3, a.4
   pkg b -> b.1, b.2, b.3, b.4

   Thus for "a":

   !a.1 | !a.2
   !a.1 | !a.3
   !a.1 | !a.4
   !a.2 | !a.3
   !a.2 | !a.4
   !a.3 | !a.4

   where !a represents the negation of a.

   This means that for N versions, we have N*(N-1)/2 clauses;
   pruning older non-accessible versions will be required to
   bound memory consumption.  In the case of package installation
   rather than upgrade, we may use the installed incorporations
   to radically prune the fmri set prior to automated solution.

3) Each version of a package may have dependencies on other
   packages, either w/ or w/o a version.  The version specification
   will likely not be fully specified (eg multiple versions
   may satisfy this requirement).

4) dependencies may be of the following types:

   required: fmri specifies minimum acceptable version

   if a.1 requires b.2, b.3 or b.4:
      !a.1 | b.2 | b.3 | b.4

   optional: if present, fmri must be at this level or greater
   if a.1 optionally requires b.3:
      !a.1 | !b.1
      !a.1 | !b.2

   incorporate: if present, pkg must match fmri

   if a.1 incorporates b.3:
      !a.1 | !b.1
      !a.1 | !b.2
      !a.1 | !b.4

   All of these are linear in the number of package versions
   either meeting or failing to meet the dependency.

5) To express state, the presence of a package is encoded as a
   clause.  We compute the matching fmris and then construct
   a clause that matches one of those fmris.  Specifying a single
   version requires that version to be present in the solution;
   we can also specify current version or newer, or any version of
   a package.  In this way we can use the same solver for "pkg install",
   pkg "image-update" and other flavors such as ways of upgrading
   to a specified version.

6) The SAT solver will find a particular solution to our packaging
   problem, but there is no way of "preferring" newer packages, and
   preventing the introduction of extraneous unneeded packages since
   all clauses and variables are strictly Boolean.

   As a result, external optimization in the form of repeated
   solution attempts w/ additional constraints is necessary.  A
   simplistic approach is to compute an optimality criterion and
   evaluate all solutions and pick the most optimal; this fails in
   the face of massive numbers of possible solutions.  A somewhat
   more clever approach is needed.

   The following algorithm has been implemented:

   The packaging problem to be solved is expressed as a series of
   boolean constraints, and a solution obtained.  Then, for each
   fmri appearing in the solution vector, all older versions are
   excluded; in other words, if a.3 is part of the solution, then
   subsequent solutions will not contain a.1 or a.2.  Then a single
   vector is added that is the negation of the just found vector,
   and another solution is found.  For example:

   if solution is a.2, b.3, z.10, we add

   # first negations of older versions
   !a.1
   !b.1
   !b.2
   !z.1
   !z.2
   ...
   !z.9
   # now negate just found solution
   !a.2 | !b.3 | !z.10

   The latter vector requires that the new solution not contain
   a.2 and b.3 and z.10; since we've excluded older versions we
   will either get a vector that eliminates one of the packages
   as unneeded (if dependencies allow) or one that has newer
   versions of one of the needed packags.

   We repeat the above process until a solution cannot be found;
   the last found solution must therefore be the most optimal one.

   The above technique may fail to find the overall optimal
   solution if newer packages have incorporation dependencies
   on earlier versions of their dependencies.  This is expected
   to be rare.  Pruning the solution space to eliminate older
   packages is necessary due to rapid solution space growth if
   there are multiple versions that satisfy dependencies; we
   may adopting a more partial pruning scheme if such "holes"
   become an issue in the future.

7) Next on the list are finding ways of reducing the time needed
   to construct the Boolean clauses.  Solution time is very
   short - typically a fraction of a second - but loading the
   problem w/ the current naive implementation is taking
   20 to 40 seconds to load the 25K+ pkg versions currently
   in the dev repo.  Pruning the solution space is part of the
   answer, as is improving the efficiency of finding matching
   packages.  I hope to have some sample code out soon to enable
   others to experiment with this.

8) Constructing user-friendly error messages may be difficult;
   mapping the inability to find a solution to a problem w/
   25K variables and 1M clauses into a concise error message
   is a bit daunting at first.

- Bart

--
Bart Smaalders                  Solaris Kernel Performance
[email protected]         http://blogs.sun.com/barts
"You will contribute more with mercurial than with thunderbird."
_______________________________________________
pkg-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/pkg-discuss

[pkg-discuss] Notes on SAT solver as applied to finding packaging solution(s)

Reply via email to