Hello Kev,
I am quite familiar with the W. Bialek's work on mutual information as in my
previous PhD was working on the closed loop application for
unsupervised learning controllers.
Will be quite happy to beta test your code!
Cheers.
From: kev devnull [mailto:[email protected]]
Sent: 22 November 2013 00:27
To: [email protected]
Subject: [Scikit-learn-general] Adding a flexible mutual
information/information theory based clustering method to sklearn.cluster?
Hi all,
I'm currently developing a Python/C application related to a population
genetics / evolution-based simulation with populations of discrete dynamical
systems (...). I am using scipy/numpy/scikit-learn/matplot lib for development
and in the course of writing the code, I've been working on a Python
implementation of "Information Based Clustering" (Slonim et al.:
http://www.pnas.org/content/102/51/18297.abstract, including mutual information
estimation: http://xxx.lanl.gov/abs/cs.IT/0502017).
The clustering algorithm has several interesting features, including being able
to swap out various "similarity/difference" matrices as (including information
theoretic measures of similarity e.g. a rate distortion matrix or a matrix of
mutual information values, but one may use whatever difference measure is most
appropriate to their data/application). I am implementing both the clustering
method in the first paper as well as the estimation of mutual information from
the second.
Much of this work came out of W. Bialek's lab, who originally developed these
ideas for comparing neural spike train time-series (he's one of the authors of
the popular computational neuroscience book "Spikes"). I've used a c++
implementation that I previously wrote for segmenting genomic time-series with
good results (just using the Euclidean distance and Pearson correlation, not
even delving into the M.I. based similarity measurements covered in the second
paper above).
In any case, I was wondering if the scikit-learn team might like an
implementation of this flexible clustering scheme that is fairly popular in the
gene regulatory network community and has features that no other clustering
algorithms that I know of have (e.g. if two members of the dataset share more
than a single bit of mutual information, then their relationship is more
complicated than simply switching one another off). I'd enjoy formatting the
Python to the standard scikit code style so that it fits well with the existing
clustering code. I would also like to contribute to additional unsupervised
learning algorithms if people would like contributors in this area.
Please let me know if the team is interested and I will get the IBC code in a
shape that is ready for submission to the project.
Thank you for your time!
-kc
------------------------------------------------------------------------------
Shape the Mobile Experience: Free Subscription
Software experts and developers: Be at the forefront of tech innovation.
Intel(R) Software Adrenaline delivers strategic insight and game-changing
conversations that shape the rapidly evolving mobile landscape. Sign up now.
http://pubads.g.doubleclick.net/gampad/clk?id=63431311&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general