Re: [scikit-learn] mutual information for continuous variables with scikit-learn

Sole Galli via scikit-learn Wed, 01 Feb 2023 06:17:10 -0800

Hey,

My understanding is that with sklearn you can compare 2 continuous variables 
like this:


mutual_info_regression(data["var1"].to_frame(), data["var"], 
discrete_features=[False])

Where var1 and var are continuous.

You can also compare multiple continuous variables against one continuous 
variables like this:

mutual_info_regression(data[["var1", "var_2", "var_3"]], data["var"],

discrete_features=[False, False, False])

I understand Sklearn uses nonparametric methods based on entropy estimation 
from k-nearest neighbors as explained in Nearest-neighbor approach to estimate 
the MI. Taken from Ross, 2014, PLoS ONE 9(2): e87357.

More details here: 
https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_regression.html

And I've got a blog post about Mutual info with Python here: 
https://www.blog.trainindata.com/mutual-information-with-python/

Cheers
Sole

Soledad Galli
https://www.trainindata.com/

Sent with [Proton Mail](https://proton.me/) secure email.

------- Original Message -------
On Wednesday, February 1st, 2023 at 10:32 AM, m m <mhfh.k...@gmail.com> wrote:

> Hello,
>
> I have two continuous variables (heart rate samples over a period of time), 
> and would like to compute mutual information between them as a measure of 
> similarity.
>
> I've read some posts suggesting to use the mutual_info_score from 
> scikit-learn but will this work for continuous variables? One stackoverflow 
> answer suggested converting the data into probabilities with np.histogram2d() 
> and passing the contingency table to the mutual_info_score.
>
> from sklearn.metrics import mutual_info_score
>
> def calc_MI(x, y, bins):
> c_xy = np.histogram2d(x, y, bins)[0]
> mi = mutual_info_score(None, None, contingency=c_xy)
> return mi
>
> # generate data
> L = np.linalg.cholesky( [[1.0, 0.60], [0.60, 1.0]])
> uncorrelated = np.random.standard_normal((2, 300))
> correlated = np.dot(L, uncorrelated)
> A = correlated[0]
> B = correlated[1]
> x = (A - np.mean(A)) / np.std(A)
> y = (B - np.mean(B)) / np.std(B)
>
> # calculate MI
> mi = calc_MI(x, y, 50)
>
> Is calc_MI a valid approach? I'm asking because I also read that when 
> variables are continuous, then the sums in the formula for discrete data 
> become integrals, but I'm not sure if this procedure is implemented in 
> scikit-learn?
>
> Thanks!

_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] mutual information for continuous variables with scikit-learn

Reply via email to