On Thu, Oct 13, 2011 at 11:29 PM, Robert Layton <[email protected]> wrote:
> That makes sense. I'll add an optional eps value, and handle the case of 0
> when it comes up.
> Thanks,
> Robert
>
> On 14 October 2011 14:23, Skipper Seabold <[email protected]> wrote:
>>
>> On Thu, Oct 13, 2011 at 11:10 PM, Robert Layton <[email protected]>
>> wrote:
>> > I'm working on adding Adjusted Mutual Information, and need to calculate
>> > the
>> > Mutual Information.
>> > I think I have the algorithm itself correct, except for the fact that
>> > whenever the contingency matrix is 0, a nan happens and propogates
>> > through
>> > the code.
>> >
>>
>> FWIW, scipy.stats defines entropy of p(x) = 0 to be 0, and I think it
>> is so by definition. The other option I've seen in software is to let
>> the user define the eps.

I assume the case where this shows up is always 0*log(0)  ( x*log(x) with x=0)
this problems shows up quite often in stats, and I'm hoping to
eventually get a numpy or scipy.special solution.

What I did in several cases to avoid any condition checking is to add
a tiny number (not 1e-16)
>>> 0*np.log(0+1e-200)
-0

not sure whether it's ever relevant, but

>>> x=1e-30
>>> x*np.log(x+1e-200)
-6.9077552789821378e-29
>>> x*np.log(x)
-6.9077552789821378e-29
>>> x*np.log(x+np.finfo(float).eps)
-3.6043653389117159e-29

josef


>>
>>
>> https://github.com/scipy/scipy/blob/master/scipy/stats/distributions.py#L5284
>>
>> > Sample code on the net [1] uses an eps=np.finfo(float).eps. Should I do
>> > this, adding eps to anything that is a denominator or parameter to log?
>> > Is there a better way?
>> >
>> > [1] http://blog.sun.tc/2010/10/mutual-informationmi-and-normalized-mutual-informationnmi-for-numpy.html
>> > FYI: My current code:
>> > def mutual_information(labels_true, labels_pred, contingency=None):
>> >     if contingency is None:
>> >         labels_true, labels_pred = check_clusterings(labels_true,
>> > labels_pred)
>> >         contingency = contingency_matrix(labels_true, labels_pred)
>> >     # Calculate P(i) for all i and P'(j) for all j
>> >     pi = np.sum(contingency, axis=1)
>> >     pi /= float(np.sum(pi))
>> >     pj = np.sum(contingency, axis=0)
>> >     pj /= float(np.sum(pj))
>> >     # Compute log for all values
>> >     log_pij = np.log(contingency)
>> >     # Product of pi and pj for denominator
>> >     pi_pj = np.outer(pi, pj)
>> >     # Remembering that log(x/y) = log(x) - log(y)
>> >     mi = np.sum(contingency * (log_pij - pi_pj))
>> >     return mi
>> > --
>> >
>> >
>> > My public key can be found at: http://pgp.mit.edu/
>> > Search for this email address and select the key from "2011-08-19" (key
>> > id:
>> > 54BA8735)
>> > Older keys can be used, but please inform me beforehand (and update when
>> > possible!)
>> >
>> >
>> >
>> > ------------------------------------------------------------------------------
>> > All the data continuously generated in your IT infrastructure contains a
>> > definitive record of customers, application performance, security
>> > threats, fraudulent activity and more. Splunk takes this data and makes
>> > sense of it. Business sense. IT sense. Common sense.
>> > http://p.sf.net/sfu/splunk-d2d-oct
>> > _______________________________________________
>> > Scikit-learn-general mailing list
>> > [email protected]
>> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>> >
>> >
>>
>>
>> ------------------------------------------------------------------------------
>> All the data continuously generated in your IT infrastructure contains a
>> definitive record of customers, application performance, security
>> threats, fraudulent activity and more. Splunk takes this data and makes
>> sense of it. Business sense. IT sense. Common sense.
>> http://p.sf.net/sfu/splunk-d2d-oct
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
>
> --
>
>
> My public key can be found at: http://pgp.mit.edu/
> Search for this email address and select the key from "2011-08-19" (key id:
> 54BA8735)
> Older keys can be used, but please inform me beforehand (and update when
> possible!)
>
>
> ------------------------------------------------------------------------------
> All the data continuously generated in your IT infrastructure contains a
> definitive record of customers, application performance, security
> threats, fraudulent activity and more. Splunk takes this data and makes
> sense of it. Business sense. IT sense. Common sense.
> http://p.sf.net/sfu/splunk-d2d-oct
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>

------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2d-oct
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to