Due to version and platform constraints, a SAT solver is necessary for conda and (now) pip. Libsolv is one of the fastest around.
https://github.com/QuantStack/mamba is conda implemented with libsolv (and parallel downloading of *declarative* dependency metadata). For general purpose graphs with implicit node instantation from edge declaration, NetworkX has implementations of many graph algorithms. https://networkx.github.io/documentation/stable/reference/algorithms/generated/networkx.algorithms.dag.topological_sort.html CuGraph (a product of the RAPIDS.ai project) does graphs with CUDA (from Python) with a "NetworkX-like API" but doesn't yet have a topological sort (though it does have BFS) https://github.com/rapidsai/cugraph "pip needs a dependency resolver" + End (Fn+Right) links to the latest work on the new pip code (that must require declarative dependency metadata) https://github.com/pypa/pip/issues/988 That said, this implementation of topo sort could have a deterministic output given an OrderedSet: https://rosettacode.org/wiki/Topological_sort#Python A table including Big-O and memory usage for the necessary data structure and methods would be useful. On Sun, Dec 29, 2019, 5:12 PM Tim Peters <tim.pet...@gmail.com> wrote: > [Larry] > > It's a lightweight abstract dependency graph. Its nodes are opaque, > > only required to be hashable. And it doesn't require that you give it > > all the nodes in strict dependency order. > > > > When you add a node, you can also optionally specify > > dependencies, and those dependencies aren't required > > to be nodes that the graph has previously seen. So you can add > > a node A that depends on B and C, without showing it B or C > > first. When that happens it creates placeholder nodes for B > > and C, and naturally these nodes have no dependencies. You > > can then later inform the graph that node B has dependencies > > on X Y and Z. > > > > (My specific use case is a package manager. Packages are identified > > with unique strings. So the nodes I give the give the graph are simply > > those package names. It's pretty common for me to get a package > > with dependencies on packages I haven't seen yet. Designing the > > graph to create placeholders let me skip making two passes over > > the package list, pre-creating the package objects, etc etc. This > > made the code simpler and presumably faster.) > > > > The graph API maintains an externally-visible "ready" iterable of > > nodes. And since B can go from ready to not-ready, it can get added > > to "ready" and subsequently removed. > > > > Also, when a node is satisfied, you simply remove it from the graph. > > If A depends on B and C, and they all represent jobs, and B and C > > have no dependencies, they'll be in the "ready" iterable. You iterate > > over "ready", and execute those jobs, and assuming they are > > successful you then remove them from the graph. When A's > > dependencies are all satisfied, it'll get added to the "ready" iterable. > > And naturally when B and C were removed from the graph, they were > > removed from the "ready" iterable too. > > OK! You're doing a topological sort. There are natural & simple ways > to do that right now that scale efficiently to very large graphs (time > & space linear in the sum of the number of nodes and the number of > dependencies). Curiously, the difficulties are mostly driven by the > quality of _error_ messages you want (in case of, e.g., cycles in the > dependency graph). > > Lots of those implementations became deterministic "by magic" when > ordered dicts were introduced. This is why: a bare bones > implementation needs to associate two pieces of info with each node: > a list of its immediate successors, and an integer count of the number > of its immediate predecessors. A dict is the natural way to implement > that association. > > When all the dependency info has been entered, then: > > - First all nodes are emitted whose predecessor count is 0. Iterating > over the association dict once is the natural way to find them, and > that order is defined now. > > - Then, as each emitted node N is marked done: > for child in N.successors: > assert child.predcount > 0 > child.predcount -= 1 > if child.predcount == 0: > emit(child) > > That's about all there is to it. But there's no end to the cruft that > _can_ be added to, e.g., verify that a node is marked done no more > than once, or to compute & display strongly connected components if > not all nodes are emitted, etc. > > Ordered sets could improve this "natural" implementation if the > successor lists were successor sets instead, simply because then - > like lists - their iteration order would be defined, which can in turn > affect the order in which nodes are emitted in the loop just above. > But lists are adequate to the task, are cheaper in both time and > space, and can't _not_ be ordered ;-) > > > > Thus it's this "ready" collection that I want to a) be iterable, b) be > stable, and > > c) support fast removal. I add and remove nodes from that set a lot, > which I > > realized would perturb the iteration order from run to run, etc etc etc, > and here > > we are. > > The sketch above (which is a bog-common way to implement topsorts) > doesn't build a data structure recording the final order, and, in > fact, never deletes anything. You might protest that the initial > iteration step (scanning the association dict to find nodes with no > predecessors) is an expense you skip because nodes with predecessors > are deleted from your "ready" structure all along. But nodes with > predecessors are _touched_ either way, and merely skipping over them > is bound to be cheaper than deleting them. > > After that initial step, no search of any kind is needed again. > > > I grant you maybe it's a weird approach. But after two false starts, > where I > > was starting to boil the oceans with ABCs and "node is satisfied" APis > and > > separate iterator objects, I had a re-think and hit on this super > lightweight > > design. I'm actually quite pleased with it--it's short, it has a > minimal API, > > it's readable, it was easy to get right, and it solves my problem. > > Whereas I would have started with a topsort and finished it while I > was sleeping ;-) Seriously, I've written such a thing many times, but > never reuse the code because it's so easy to redo from scratch. Every > application has _some_ unique twist, which makes a general-purpose API > so bloated it's harder to figure out how to use than to write custom > code. > > In your application, I'm guessing (but don't know) that when a package > name is emitted, it's possible you're not able to find the package, > and so the process has to stop. That's why I inserted a "as each > emitted node N is marked done" step to my description. That is, I'm > picturing an API that adds a (say) > > topsorter.record_succeeded(node) > > method to say that - yes - the node was processed successfully. Else, > by default, its successors (if any) must not be emitted. > > But if that isn't needed, the whole topsort order can be materialized > into (say) a list in one gulp, or delivered by a generator. > > > > Happy new year, > > And to you too! :-) > _______________________________________________ > Python-Dev mailing list -- python-dev@python.org > To unsubscribe send an email to python-dev-le...@python.org > https://mail.python.org/mailman3/lists/python-dev.python.org/ > Message archived at > https://mail.python.org/archives/list/python-dev@python.org/message/HRKQPTPQQNIU3DYMYUYRC7USVRJGMRFC/ > Code of Conduct: http://python.org/psf/codeofconduct/ >
_______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/6UKXR63M6NYI3EQRXL7AOH3AWMTSNXUW/ Code of Conduct: http://python.org/psf/codeofconduct/