Hello,

I'm doing some work involving finding the shortest Dijkstra distances
between nodes in a graph. I'm interested in contributing to sklearn to
overcome an issue I ran into.

My problem is that I have over N=100,000 nodes defining the graph, but I'm
only interested in getting distances between all pairs within a subset of
them (e.g., between the M=10,000 nodes of interest). Not only would
calculating /all/ pairs be incredibly inefficient, but it causes a memory
error when using `graph_shortest_path.py`. Unfortunately
`single_source_shortest_path` doesn't do Dijkstra distances (only
unweighted), so I can't use that -- it's also not anywhere near as
efficient. I can get away with using networkx functions, but they are also
much slower than the sklearn Cython code.

It seems the following fairly simple changes to `graph_shortest_path.pyx`
would enable my desired behavior of subselecting a set of nodes to
calculate pairwise distances for:

1. Add argument `nodes=None` to allow specification of nodes of interest. A
user passing `None` would basically be translated into being
`np.arange(N)`, where N is the number of nodes in the graph, to maintain
current behavior. When argument != None, `dijkstra` must be the selected
method, unless there's an easy way to modify `floyd_warshall` to operate on
selected rows (I don't immediately see how it's possible).
2. Modify `dijkstra` to take and pass the additional `nodes` argument (now
actually an array).
3. Modify `dijkstra` to loop over only selected indices (line 213 and 229:
`for i from 0 <= i < N:`).
4. Modify the output type of `dijkstra` and `graph_shortest_path` to be an
MxM array, where M is the number of nodes for which distances are
calculated.

Instead of changing `graph_shortest_path`, a second option is to overhaul
`single_source_shortest_path_length` to make use of the `dijkstra` code, as
suggested on the developers utilities page. The issue with changing the
behavior of that function, though, is that it doesn't currently do weighted
path lengths, so it might cause an effective API change. It also wouldn't
be as efficient to loop over my 10,000 vertices in python, select the
appropriate 10,000-element subset from the 100,000 that were computed, and
make an array of those (compared to doing everything in C).

A third option is to make another function entirely, perhaps
`multiple_source_shortest_path`. It would borrow heavily from
`graph_shortest_path.pyx`. This option doesn't sound that great to me due
to code duplication.

Thoughts?

Cheers,
Eric
------------------------------------------------------------------------------
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from 
the latest Intel processors and coprocessors. See abstracts and register >
http://pubads.g.doubleclick.net/gampad/clk?id=60134071&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to