Hi, Michael, thank you for your comments and ideas.
I was probably not clear enough about the complexity to implement these
algorithms, because I don't think it's a substantial amount of work. Excuse
me if this was already understood, but let me add that consensus clustering
is not limited to graphs. It's about combining multiple clustering
solutions into a single consolidated partition. Using graphs and graph
partitioning is just a way to do it. I was just describing some points of a
paper (a very popular one, though) that proposes three graph-based
consensus functions. But there are many algorithms to combine partitions.
For example, I could implement an "evidence accumulation" approach
(proposed in this paper
<http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=1432715>)
just with a hierarchical clustering algorithm (already provided by
scikit-learn) and a set of diverse partitions generated with k-means,
meanshift and dbscan (all provided by sklearn as well), by varying their
parameters (the number of clusters for k-means, the eps or min_samples for
dbscan, etc).
Although I have to take a deeper look at igraph, I think sklearn is better
suited to this kind of algorithms. It provides all the necessary components
to implement consensus clustering: methods for ensemble generation (as it
provides many clustering algorithms as well as methods to sample data
points, feature selection and data projection) and methods for ensemble
combination (one type of consensus functions can be implemented with
hierarchical clustering; it works by making a similarity matrix of the data
using the input partitions and then clustering over it).
In many aspects, an API for these methods would be similar/analogous to
ensemble classifiers' currently implemented in sklearn. In fact, cluster
ensembles were inspired mostly by classifier ensembles.
Regards,
Milton.
2015-02-13 13:05 GMT-03:00 Michael Bommarito <mich...@bommaritollc.com>:
> Milton, my opinion is that the best work available in Python for
> clustering and community detection has been done in the igraph project (
> http://igraph.org/). While I would personally love to see better support
> for these un- and semi-supervised taks in sklearn, it is a substantial
> investment of time and LOC. If I were you, I would reach out to Gabor or
> Tamas to see if they would accept such a PR there in igraph; I would be
> happy to introduce you if you'd like.
>
> Thanks,
> Michael J. Bommarito II, CEO
> Bommarito Consulting, LLC
> *Web:* http://www.bommaritollc.com
> *Mobile:* +1 (646) 450-3387
>
> On Fri, Feb 13, 2015 at 10:08 AM, Ronnie Ghose <ronnie.gh...@gmail.com>
> wrote:
>
>> -1 we would have to build in support for more clustering methods ,sounds
>> like a not-very-standalone proj
>>
>> On Fri, Feb 13, 2015 at 10:02 AM, Milton Pividori <milto...@gmail.com>
>> wrote:
>>
>>> Hi, Andy. Thank you for the interest.
>>>
>>> Consensus clustering is usually used in the same context as traditional
>>> clustering techniques. Many papers have reported significantly accuracy
>>> improvements when using these methods, as they can combine partitions from
>>> several different algorithm, finding interesting structures, usually not
>>> discovered by traditional methods. They are similar to ensemble methods in
>>> the supervised world, although they have their own particularities, of
>>> course.
>>>
>>> One of the motivations of these methods is to avoid the choice of a
>>> single clustering algorithm by the inexperienced user, who usually finds a
>>> lot of different alternatives for his problem, and this choice is generally
>>> not easy for them. Consensus clustering tries to mitigate this by running
>>> several clustering methods with different parameters (like the number of
>>> clusters). This set of partitions is called ensemble, and it is the input
>>> of the consensus function, which derives from it a single consensus
>>> partition, which usually outperforms all the individual members of the
>>> input set. The JMLR paper
>>> <http://www.jmlr.org/papers/volume3/strehl02a/strehl02a.pdf> I
>>> mentioned before proposes a framework for this, called Robust Centralized
>>> Clustering (RCC).
>>>
>>> Another interesting applications of these methods, as mentioned in the
>>> previous paper, are the Feature-Distributed Clustering (FDC) and
>>> Object-Distributed Clustering (ODC). The first one, FDC, allows the user to
>>> combine partitions generated from partial views of the data. A common
>>> scenario are distributed data bases, which usually can not be integrated at
>>> a centralized location because of different aspects (proprietary data,
>>> privacy concerns, performance issues, etc). In such scenarios, it is more
>>> realistic to have different "clusterers" at those different places, and
>>> then combine only the clustering results at a central location. This is
>>> possible because the consensus function only needs access to cluster labels
>>> produced by those clusterers (traditional methods), not to the whole data.
>>> The other application, ODC, is similar but with distributed objects instead
>>> of distributed features, and it has their own challenges. An example is a
>>> distributed customer data base of a company located at different cities.
>>> One of the issues here, for instance, is that the consensus function needs
>>> some overlap.
>>>
>>> Well, this is a short description of these methods. Let me know if you
>>> need more details.
>>>
>>> Regards,
>>>
>>> Milton
>>>
>>> 2015-02-12 18:47 GMT-03:00 Andy <t3k...@gmail.com>:
>>>
>>> Hi Milton.
>>>>
>>>> In which context is consensus clustering usually used, and what are the
>>>> main applications?
>>>> We will not add an external dependency, sorry.
>>>>
>>>> Cheers,
>>>> Andy
>>>>
>>>>
>>>>
>>>> On 02/12/2015 01:55 PM, Milton Pividori wrote:
>>>>
>>>> Hi, guys. My name is Milton Pividori and this is the first time I write
>>>> to this list. I'm a PhD student, working on clustering, particularly on
>>>> consensus clustering. I'm relatively new to Python, and I am migrating
>>>> legacy code from MATLAB. I plan to use scikit-learn as well as other
>>>> libraries.
>>>>
>>>> After looking at the scikit code and the mailing list, I didn't found
>>>> any methods related to consensus clustering or cluster ensembles. I think
>>>> the main paper about it is the one from Strehl and Ghosh (2002, JMLR,
>>>> link <http://www.jmlr.org/papers/volume3/strehl02a/strehl02a.pdf>). I
>>>> don't know if you discussed about it before, but I think it could be a good
>>>> idea to have these consensus functions implemented in scikit-learn (the
>>>> paper proposes three, graph-based).
>>>>
>>>> I was thinking on how to implement them. These three consensus
>>>> functions (CSPA, HGPA and MCLA) use METIS for graph partitioning. That
>>>> could be an obstacle for scikit-learn interests, as a new dependency would
>>>> be needed (I found python bindings for it). It would be also necessary to
>>>> implement some methods for ensemble generation with varying levels of
>>>> diversity (generating different clustering partitions by varying
>>>> algorithms, changing their parameters or manipulating data with
>>>> projections, subsampling or feature selection), but that's easier than
>>>> implementing the consensus functions.
>>>>
>>>> Well, it's just an idea. I would be glad to help with coding if this
>>>> is interesting for the community.
>>>>
>>>> Regards,
>>>>
>>>> 2015-02-12 13:38 GMT-03:00 Sebastian Raschka <se.rasc...@gmail.com>:
>>>>
>>>>> What about adding multiclass support for the SVC "roc_auc" for grid
>>>>> search CV to the to do list?
>>>>>
>>>>> Best,
>>>>> Sebastian
>>>>>
>>>>> On Feb 12, 2015, at 10:12 AM, Ronnie Ghose <ronnie.gh...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> +1 to partial fit -1 to gam and more probabilistic things in sklean
>>>>>
>>>>> On Thu, Feb 12, 2015, 9:22 AM ragv ragv <rag...@gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Is there a good deal of interest in having GAMs implemented?
>>>>>>
>>>>>> The timeline for such a project would go something like :
>>>>>>
>>>>>> Before GSoC:
>>>>>> * Implement SpAM
>>>>>>
>>>>>> Before Midterm :
>>>>>> * Help merge pyearth into scikit learn
>>>>>> * Implement Additive Model -> `AdditiveClassifier` /
>>>>>> `AdditiveRegressor` ( Not sure if my wording here is correct )
>>>>>>
>>>>>> After Midterm :
>>>>>> * Implement GAMLSS
>>>>>> * Implement LISO
>>>>>>
>>>>>> Kindly also see
>>>>>> https://github.com/scikit-learn/scikit-learn/issues/3482 for
>>>>>> references with citation counts.
>>>>>>
>>>>>> The package mgcv by Simon Woods / GAM / BAM in CRAN is mature and
>>>>>> could be used as reference material too...
>>>>>>
>>>>>> On a scale of 0 to 100 could I know how much importance / interest
>>>>>> would there be in such a project for GSoC 2015?
>>>>>>
>>>>>>
>>>>>> ------------------------------------------------------------------------------
>>>>>> Dive into the World of Parallel Programming. The Go Parallel Website,
>>>>>> sponsored by Intel and developed in partnership with Slashdot Media,
>>>>>> is your
>>>>>> hub for all things parallel software development, from weekly thought
>>>>>> leadership blogs to news, videos, case studies, tutorials and more.
>>>>>> Take a
>>>>>> look and join the conversation now.
>>>>>> http://goparallel.sourceforge.net/
>>>>>> _______________________________________________
>>>>>> Scikit-learn-general mailing list
>>>>>> Scikit-learn-general@lists.sourceforge.net
>>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>> Dive into the World of Parallel Programming. The Go Parallel Website,
>>>>> sponsored by Intel and developed in partnership with Slashdot Media,
>>>>> is your
>>>>> hub for all things parallel software development, from weekly thought
>>>>> leadership blogs to news, videos, case studies, tutorials and more.
>>>>> Take a
>>>>> look and join the conversation now. http://goparallel.sourceforge.net/
>>>>>
>>>>> _______________________________________________
>>>>> Scikit-learn-general mailing list
>>>>> Scikit-learn-general@lists.sourceforge.net
>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>> Dive into the World of Parallel Programming. The Go Parallel Website,
>>>>> sponsored by Intel and developed in partnership with Slashdot Media,
>>>>> is your
>>>>> hub for all things parallel software development, from weekly thought
>>>>> leadership blogs to news, videos, case studies, tutorials and more.
>>>>> Take a
>>>>> look and join the conversation now. http://goparallel.sourceforge.net/
>>>>> _______________________________________________
>>>>> Scikit-learn-general mailing list
>>>>> Scikit-learn-general@lists.sourceforge.net
>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Milton Pividori
>>>> Blog: www.miltonpividori.com.ar
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Dive into the World of Parallel Programming. The Go Parallel Website,
>>>> sponsored by Intel and developed in partnership with Slashdot Media, is
>>>> your
>>>> hub for all things parallel software development, from weekly thought
>>>> leadership blogs to news, videos, case studies, tutorials and more. Take a
>>>> look and join the conversation now. http://goparallel.sourceforge.net/
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Scikit-learn-general mailing
>>>> listScikit-learn-general@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Dive into the World of Parallel Programming. The Go Parallel Website,
>>>> sponsored by Intel and developed in partnership with Slashdot Media, is
>>>> your
>>>> hub for all things parallel software development, from weekly thought
>>>> leadership blogs to news, videos, case studies, tutorials and more.
>>>> Take a
>>>> look and join the conversation now. http://goparallel.sourceforge.net/
>>>> _______________________________________________
>>>> Scikit-learn-general mailing list
>>>> Scikit-learn-general@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>
>>>>
>>>
>>>
>>> --
>>> Milton Pividori
>>> Blog: www.miltonpividori.com.ar
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Dive into the World of Parallel Programming. The Go Parallel Website,
>>> sponsored by Intel and developed in partnership with Slashdot Media, is
>>> your
>>> hub for all things parallel software development, from weekly thought
>>> leadership blogs to news, videos, case studies, tutorials and more. Take
>>> a
>>> look and join the conversation now. http://goparallel.sourceforge.net/
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> Scikit-learn-general@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>>
>>
>>
>> ------------------------------------------------------------------------------
>> Dive into the World of Parallel Programming. The Go Parallel Website,
>> sponsored by Intel and developed in partnership with Slashdot Media, is
>> your
>> hub for all things parallel software development, from weekly thought
>> leadership blogs to news, videos, case studies, tutorials and more. Take a
>> look and join the conversation now. http://goparallel.sourceforge.net/
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
> ------------------------------------------------------------------------------
> Dive into the World of Parallel Programming. The Go Parallel Website,
> sponsored by Intel and developed in partnership with Slashdot Media, is
> your
> hub for all things parallel software development, from weekly thought
> leadership blogs to news, videos, case studies, tutorials and more. Take a
> look and join the conversation now. http://goparallel.sourceforge.net/
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
--
Milton Pividori
Blog: www.miltonpividori.com.ar
------------------------------------------------------------------------------
Dive into the World of Parallel Programming. The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general