Hi Sam,

I agree that a meta estimator would be the way to go. However, the question is: 
are these techniques used widely enough to warrant inclusion in scikit-learn? I 
have the impression that they are used much less than agglomerative approaches. 

I have myself used them in the past with success (it's actually fairly easy to 
code,
), but that was for a very specific problem. 

Do you have evidence that such a meta estimator would be widely used? 

Cheers, 

Gaƫl 

Sent from my phone. Please forgive brevity and mis spelling



On May 18, 2015, 18:29, at 18:29, Sam Schetterer <samsc...@gmail.com> wrote:
>Andreas,
>
>That's pretty much what my idea for this would be. By default, the
>algorithm uses bisecting kmeans but you can specify any clusterer that
>follows the scikit-learn api or any function that follows a specific
>API. I
>think that there are some interesting possibilities with allowing the
>cluster criteria to be based on a user-supplied predicate instead of
>just
>n_clusters as well, especially in the context of mixing/matching
>different
>clustering algorithms.
>
>The difficult part will be creating an easily usable set of
>functions/decorators to make specifying custom algorithms simple, and
>keep
>them separate from implementation details of the divisive algorithm. As
>with anything that's highly customizable, the hard part is preventing
>that
>customizability from interfering with simple use cases of the
>algorithm.
>
>
>Joel,
>
>I'm working on getting a set of tests/examples built for this algorithm
>and
>then I'll submit the PR. In terms of deciding which cluster to split
>next,
>the commonly used methods are to split the largest cluster and for
>centroid
>based methods, to split the cluster with largest variance from the
>center.
>In line with the meta clustering, users will be able to supply criteria
>of
>their own but the algorithm should choose sensible defaults whenever
>possible.
>
>Best,
>Sam
>
>On Mon, May 18, 2015 at 10:34 AM, Andreas Mueller <t3k...@gmail.com>
>wrote:
>
>>  This feels a bit like it should be a meta-estimator using an
>arbitrary
>> clustering algorithm to create a divisive one.
>> That would easily allow the PCA thing.
>>
>>
>>
>> On 05/17/2015 07:44 PM, Joel Nothman wrote:
>>
>> Hi Sam,
>>
>>  I think this could be interesting. You could allow for learning
>> parameters on each sub-cluster by accepting a transformer as a
>parameter,
>> then using sample =
>sklearn.base.clone(transformer).fit_transform(sample).
>>
>>  I suspect bisecting k-means is notable enough and different enough
>for
>> inclusion. Seeing as you have an implementation, I would suggest
>> constructing a small example (preferably on real-world data) that
>> highlights the superiority or distinctiveness of this approach. Once
>you
>> have something illustrative, submit a PR with the output and see how
>people
>> review it.
>>
>>  In terms of establishing a fixed algorithm, is the criterion for
>which
>> cluster is next expanded standard in the literature? Are there
>> alternatives? (I've not read up yet.)
>>
>>  Thanks,
>>
>>  Joel
>>
>> On 17 May 2015 at 06:43, Sam Schetterer <samsc...@gmail.com> wrote:
>>
>>> Andreas,
>>>
>>>  There isn't necessarily a linkage function defined, at least in the
>>> sense of agglomerative clustering, since this is not comparing
>clusters to
>>> merge but rather breaking them up. The clusters are split using
>another
>>> clustering algorithm supplied by the caller. The most common one
>that I've
>>> found in literature is kmeans with 2 clusters, which leads to a
>binary tree
>>> structure and is generally referred to as bisecting kmeans (used for
>>> example in the first citation). One could use any clustering
>algorithm,
>>> even have two different ones that are used in different conditions
>>> (spectral clustering when n < 1000 and kmeans otherwise, for
>example).
>>>
>>>  In addition, with divisive clustering, one can refine the distance
>>> metric for various tree branches which I don't think is possible
>with
>>> hierarchical clustering. I've done this with text clustering to get
>more
>>> accurate tf-idf deeper in the hierarchy and the second paper I cited
>in the
>>> original email performs the SVD at each new level.
>>>
>>>  You bring up a good point about divisive vs agglomerative being an
>>> implementation detail although I think for certain uses, it may be
>very
>>> important. If it's expensive to compute a connectivity matrix, a
>bisecting
>>> kmeans will perform significantly better than the agglomerative
>methods on
>>> larger datasets.
>>>
>>>  Best,
>>> Sam
>>>
>>
>>
>>
>------------------------------------------------------------------------------
>> One dashboard for servers and applications across
>Physical-Virtual-Cloud
>> Widest out-of-the-box monitoring support with 50+ applications
>> Performance metrics, stats and reports that give you Actionable
>Insights
>> Deep dive visibility with transaction tracing using APM
>Insight.http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
>>
>>
>>
>> _______________________________________________
>> Scikit-learn-general mailing
>listScikit-learn-general@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>>
>>
>>
>------------------------------------------------------------------------------
>> One dashboard for servers and applications across
>Physical-Virtual-Cloud
>> Widest out-of-the-box monitoring support with 50+ applications
>> Performance metrics, stats and reports that give you Actionable
>Insights
>> Deep dive visibility with transaction tracing using APM Insight.
>> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
>------------------------------------------------------------------------
>
>------------------------------------------------------------------------------
>One dashboard for servers and applications across
>Physical-Virtual-Cloud 
>Widest out-of-the-box monitoring support with 50+ applications
>Performance metrics, stats and reports that give you Actionable
>Insights
>Deep dive visibility with transaction tracing using APM Insight.
>http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
>
>------------------------------------------------------------------------
>
>_______________________________________________
>Scikit-learn-general mailing list
>Scikit-learn-general@lists.sourceforge.net
>https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to