Hi,
The cs_graph_components routine in sklearn utils has a few problems: in
particular, though the doc string says it only accesses the upper
triangular part of the matrix, this is not true. Internally it assumes the
matrix is symmetric, and will not return the correct results if the
lower-triangular piece is zero. Also, I believe that due to the symmetry
requirement it effectively only finds strongly-connected components, so a
MST should be considered entirely unconnected!
The components-finding algorithm has been greatly improved in the
scipy.sparse.csgraph, especially given a recent PR which sped up the
algorithm and improved memory performance immensely. It may be worth
back-porting that routine as well, and replacing the old
cs_graph_components utility routine.
Jake
On Wed, Jun 5, 2013 at 4:54 PM, Robert Layton <robertlay...@gmail.com>wrote:
> I'm using this workflow for a minimum spanning tree clusterer for my
> current PR on evidence accumulation clustering (see
> https://github.com/scikit-learn/scikit-learn/pull/1830 )
>
> My understanding is that the output of minimum_spanning_tree (recently
> backported to sklearn) should plug directly into cs_graph_components.
>
> The code for the fit function is this:
> #imports
> from ..utils.mst import minimum_spanning_tree
> from ..utils._csgraph import cs_graph_components
>
> <snipped>
>
> def fit(X):
> self.span_tree = minimum_spanning_tree(X)
> # Remote any link with a distance more than self.threshold
> self.span_tree.data[self.span_tree.data < self.threshold] = 0
> self.span_tree = self.span_tree.trunc()
> # Compute clusters by finding connected subgraphs in self.span_tree
> n_components, self.labels_ = cs_graph_components(self.span_tree)
> return self
>
>
> This doesn't give the intended answer when run with this test (matrix is
> from
> http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csgraph.minimum_spanning_tree.html
> ):
>
> def simple_test():
> # Uses the simple example from the website
> X = csr_matrix([[0, 8, 0, 3],
> [0, 0, 2, 5],
> [0, 0, 0, 6],
> [0, 0, 0, 0]])
> model = MSTCluster(threshold=3) # Clustering algorithm defined above
> labels = model.fit(X).labels_
> assert_array_equal(labels, np.array([0,0,1,0]),
> "Given={}\nExpected={}".format(labels, np.array([0,0,1,0])))
> return labels
>
> Instead, the labels are [ 0 1 -2 -2]
> I would expect this to be [0, -2, 0, 0], given that unconnected nodes are
> given -2 as a label (I can fix this). Ideally it would be [0, 0, 1, 0]
>
> There seems to be a bit of discussion about this here:
> http://mail.scipy.org/pipermail/scipy-dev/2012-March/017261.html
>
> My thinking is that this is the problem that I'm encountering. If so,
> should this backported code be updated to a more recent version?
>
>
> Thanks,
>
> - Robert
>
>
>
>
>
>
> --
>
> Public key at: http://pgp.mit.edu/ Search for this email address and
> select the key from "2011-08-19" (key id: 54BA8735)
>
>
> ------------------------------------------------------------------------------
> How ServiceNow helps IT people transform IT departments:
> 1. A cloud service to automate IT design, transition and operations
> 2. Dashboards that offer high-level views of enterprise services
> 3. A single system of record for all IT processes
> http://p.sf.net/sfu/servicenow-d2d-j
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
------------------------------------------------------------------------------
How ServiceNow helps IT people transform IT departments:
1. A cloud service to automate IT design, transition and operations
2. Dashboards that offer high-level views of enterprise services
3. A single system of record for all IT processes
http://p.sf.net/sfu/servicenow-d2d-j
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general