Hi all,
A colleague approached me today asking how the scikit-learn DBSCAN
algorithm should be applied and I must admit that the documentation
and example was confusing even to me. The fit docstring says
X: array [n_samples, n_samples] or [n_samples, n_features]
Array of distances between samples, or a feature array.
The array is treated as a feature array unless the metric is given as
'precomputed'.
However, the online demo does the following:
D = distance.squareform(distance.pdist(X))
S = 1 - (D / np.max(D))
db = DBSCAN().fit(S, eps=0.95, min_samples=10)
which uses a similarity matrix rather than a feature matrix as input
without passing metric="precomputed". Am I missing some interesting
clustering trick here, or is this a bug? I tried running the example
with the original feature matrix X (without tuning the parameters) and
it gave different output: all points were considered a single cluster
with no outliers.
TIA,
Lars
[1] http://scikit-learn.org/0.10/auto_examples/cluster/plot_dbscan.html
--
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam
------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general