Tim Peters <[email protected]> added the comment:
Let's stir this up a bit ;-) I recently had occasion to exchange ideas with
Larry Hastings about topsorts for use in a package manager he's writing. I
thought his API for adding edges was ... perfect:
add(node, *dependson)
So, e.g., add(A, B, C) says A depends on B, and on C, but says nothing else
about B and C. This is almost always the way topsorts show up in real life:
you know what a thing depends *on* directly, but have scant idea how things may
be in the opposite direction. For example, you know that baking a cake
requires (among other things) flour, but have no real idea of the universe of
other things that require flour. Likewise Larry knows which packages each
package requires, but not the reverse. Etc.
Nodes with no edges are trivial to add then: add(A).
If you're building input to a topsort from a graph, also trivial:
for n, p in node2predecessors.items():
topsort_instance.add(n, *p)
and it doesn't matter whether the predecessors in the original graph were
stored in a list, set, tuple, or any other iterable container. Nothing special
about an empty collection of predecessors either.
The other big thing that came up is that most topsort programs were useless for
his goal: downloading and installing packages takes significant wall clock
time, and there's huge opportunity for exploiting parallelism. But a flat
sequence in topsort order gives no clue about what _can_ be done in parallel.
Instead you really want several methods, like
prepare()
to say that you're done building the graph; and,
get_ready()
to get all nodes ready to go, which haven't already been returned by
get_ready() calls (initially, this is the collection of nodes with no
predecessors, which prepare() can know); and,
done(node)
to say that `node` (returned by a previous call to get_ready()) is finished
now, so that the next call to get_ready() can return all (if any) nodes for
which `node` was the last non-done predecessor; and,
is_active()
to say whether the topsort can make more progress (is_active() returns True iff
there are still nodes ready to go that haven't yet been passed out by
get_ready(), or if the number of nodes marked done() is less than the number
that have been passed out by get_ready()).
These are all easy to code, and allow the user to extract all the opportunities
for parallelism that theoretically exist. There is no static order that can do
so, since the opportunities that exist at any time depend on the times and
order in which nodes are marked done() in real life - and that may vary from
one run to the next.
Of course a deterministic static order can be derived from those, like, e.g.,
def static_order(self):
self.prepare()
while self.is_active():
for node in self.get_ready():
yield node
self.done(node)
For parallel use, e.g.,
self.prepare()
while instance.is_active():
for node in instance.get_ready():
inq.put(node)
node = outq.get()
instance.done(node)
where worker threads or processes take nodes to work on off of queue `inq`,
then, when the work for a node is done, put the node on queue `outq`.
Am I seriously suggesting this for Python? Sure. It's fun to advance the
practical state of the art :-)
----------
nosy: +tim.peters
_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue17005>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com