Hi Sam,

I think this could be interesting. You could allow for learning parameters
on each sub-cluster by accepting a transformer as a parameter, then using
sample = sklearn.base.clone(transformer).fit_transform(sample).

I suspect bisecting k-means is notable enough and different enough for
inclusion. Seeing as you have an implementation, I would suggest
constructing a small example (preferably on real-world data) that
highlights the superiority or distinctiveness of this approach. Once you
have something illustrative, submit a PR with the output and see how people
review it.

In terms of establishing a fixed algorithm, is the criterion for which
cluster is next expanded standard in the literature? Are there
alternatives? (I've not read up yet.)

Thanks,

Joel

On 17 May 2015 at 06:43, Sam Schetterer <samsc...@gmail.com> wrote:

> Andreas,
>
> There isn't necessarily a linkage function defined, at least in the sense
> of agglomerative clustering, since this is not comparing clusters to merge
> but rather breaking them up. The clusters are split using another
> clustering algorithm supplied by the caller. The most common one that I've
> found in literature is kmeans with 2 clusters, which leads to a binary tree
> structure and is generally referred to as bisecting kmeans (used for
> example in the first citation). One could use any clustering algorithm,
> even have two different ones that are used in different conditions
> (spectral clustering when n < 1000 and kmeans otherwise, for example).
>
> In addition, with divisive clustering, one can refine the distance metric
> for various tree branches which I don't think is possible with hierarchical
> clustering. I've done this with text clustering to get more accurate tf-idf
> deeper in the hierarchy and the second paper I cited in the original email
> performs the SVD at each new level.
>
> You bring up a good point about divisive vs agglomerative being an
> implementation detail although I think for certain uses, it may be very
> important. If it's expensive to compute a connectivity matrix, a bisecting
> kmeans will perform significantly better than the agglomerative methods on
> larger datasets.
>
> Best,
> Sam
>
------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to