Andreas,

That's pretty much what my idea for this would be. By default, the
algorithm uses bisecting kmeans but you can specify any clusterer that
follows the scikit-learn api or any function that follows a specific API. I
think that there are some interesting possibilities with allowing the
cluster criteria to be based on a user-supplied predicate instead of just
n_clusters as well, especially in the context of mixing/matching different
clustering algorithms.

The difficult part will be creating an easily usable set of
functions/decorators to make specifying custom algorithms simple, and keep
them separate from implementation details of the divisive algorithm. As
with anything that's highly customizable, the hard part is preventing that
customizability from interfering with simple use cases of the algorithm.


Joel,

I'm working on getting a set of tests/examples built for this algorithm and
then I'll submit the PR. In terms of deciding which cluster to split next,
the commonly used methods are to split the largest cluster and for centroid
based methods, to split the cluster with largest variance from the center.
In line with the meta clustering, users will be able to supply criteria of
their own but the algorithm should choose sensible defaults whenever
possible.

Best,
Sam

On Mon, May 18, 2015 at 10:34 AM, Andreas Mueller <t3k...@gmail.com> wrote:

>  This feels a bit like it should be a meta-estimator using an arbitrary
> clustering algorithm to create a divisive one.
> That would easily allow the PCA thing.
>
>
>
> On 05/17/2015 07:44 PM, Joel Nothman wrote:
>
> Hi Sam,
>
>  I think this could be interesting. You could allow for learning
> parameters on each sub-cluster by accepting a transformer as a parameter,
> then using sample = sklearn.base.clone(transformer).fit_transform(sample).
>
>  I suspect bisecting k-means is notable enough and different enough for
> inclusion. Seeing as you have an implementation, I would suggest
> constructing a small example (preferably on real-world data) that
> highlights the superiority or distinctiveness of this approach. Once you
> have something illustrative, submit a PR with the output and see how people
> review it.
>
>  In terms of establishing a fixed algorithm, is the criterion for which
> cluster is next expanded standard in the literature? Are there
> alternatives? (I've not read up yet.)
>
>  Thanks,
>
>  Joel
>
> On 17 May 2015 at 06:43, Sam Schetterer <samsc...@gmail.com> wrote:
>
>> Andreas,
>>
>>  There isn't necessarily a linkage function defined, at least in the
>> sense of agglomerative clustering, since this is not comparing clusters to
>> merge but rather breaking them up. The clusters are split using another
>> clustering algorithm supplied by the caller. The most common one that I've
>> found in literature is kmeans with 2 clusters, which leads to a binary tree
>> structure and is generally referred to as bisecting kmeans (used for
>> example in the first citation). One could use any clustering algorithm,
>> even have two different ones that are used in different conditions
>> (spectral clustering when n < 1000 and kmeans otherwise, for example).
>>
>>  In addition, with divisive clustering, one can refine the distance
>> metric for various tree branches which I don't think is possible with
>> hierarchical clustering. I've done this with text clustering to get more
>> accurate tf-idf deeper in the hierarchy and the second paper I cited in the
>> original email performs the SVD at each new level.
>>
>>  You bring up a good point about divisive vs agglomerative being an
>> implementation detail although I think for certain uses, it may be very
>> important. If it's expensive to compute a connectivity matrix, a bisecting
>> kmeans will perform significantly better than the agglomerative methods on
>> larger datasets.
>>
>>  Best,
>> Sam
>>
>
>
> ------------------------------------------------------------------------------
> One dashboard for servers and applications across Physical-Virtual-Cloud
> Widest out-of-the-box monitoring support with 50+ applications
> Performance metrics, stats and reports that give you Actionable Insights
> Deep dive visibility with transaction tracing using APM 
> Insight.http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
>
>
>
> _______________________________________________
> Scikit-learn-general mailing 
> listScikit-learn-general@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
>
>
> ------------------------------------------------------------------------------
> One dashboard for servers and applications across Physical-Virtual-Cloud
> Widest out-of-the-box monitoring support with 50+ applications
> Performance metrics, stats and reports that give you Actionable Insights
> Deep dive visibility with transaction tracing using APM Insight.
> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to