On Thu, Oct 13, 2011 at 11:29 PM, Robert Layton <[email protected]> wrote: > That makes sense. I'll add an optional eps value, and handle the case of 0 > when it comes up. > Thanks, > Robert > > On 14 October 2011 14:23, Skipper Seabold <[email protected]> wrote: >> >> On Thu, Oct 13, 2011 at 11:10 PM, Robert Layton <[email protected]> >> wrote: >> > I'm working on adding Adjusted Mutual Information, and need to calculate >> > the >> > Mutual Information. >> > I think I have the algorithm itself correct, except for the fact that >> > whenever the contingency matrix is 0, a nan happens and propogates >> > through >> > the code. >> > >> >> FWIW, scipy.stats defines entropy of p(x) = 0 to be 0, and I think it >> is so by definition. The other option I've seen in software is to let >> the user define the eps.
I assume the case where this shows up is always 0*log(0) ( x*log(x) with x=0) this problems shows up quite often in stats, and I'm hoping to eventually get a numpy or scipy.special solution. What I did in several cases to avoid any condition checking is to add a tiny number (not 1e-16) >>> 0*np.log(0+1e-200) -0 not sure whether it's ever relevant, but >>> x=1e-30 >>> x*np.log(x+1e-200) -6.9077552789821378e-29 >>> x*np.log(x) -6.9077552789821378e-29 >>> x*np.log(x+np.finfo(float).eps) -3.6043653389117159e-29 josef >> >> >> https://github.com/scipy/scipy/blob/master/scipy/stats/distributions.py#L5284 >> >> > Sample code on the net [1] uses an eps=np.finfo(float).eps. Should I do >> > this, adding eps to anything that is a denominator or parameter to log? >> > Is there a better way? >> > >> > [1] http://blog.sun.tc/2010/10/mutual-informationmi-and-normalized-mutual-informationnmi-for-numpy.html >> > FYI: My current code: >> > def mutual_information(labels_true, labels_pred, contingency=None): >> > if contingency is None: >> > labels_true, labels_pred = check_clusterings(labels_true, >> > labels_pred) >> > contingency = contingency_matrix(labels_true, labels_pred) >> > # Calculate P(i) for all i and P'(j) for all j >> > pi = np.sum(contingency, axis=1) >> > pi /= float(np.sum(pi)) >> > pj = np.sum(contingency, axis=0) >> > pj /= float(np.sum(pj)) >> > # Compute log for all values >> > log_pij = np.log(contingency) >> > # Product of pi and pj for denominator >> > pi_pj = np.outer(pi, pj) >> > # Remembering that log(x/y) = log(x) - log(y) >> > mi = np.sum(contingency * (log_pij - pi_pj)) >> > return mi >> > -- >> > >> > >> > My public key can be found at: http://pgp.mit.edu/ >> > Search for this email address and select the key from "2011-08-19" (key >> > id: >> > 54BA8735) >> > Older keys can be used, but please inform me beforehand (and update when >> > possible!) >> > >> > >> > >> > ------------------------------------------------------------------------------ >> > All the data continuously generated in your IT infrastructure contains a >> > definitive record of customers, application performance, security >> > threats, fraudulent activity and more. Splunk takes this data and makes >> > sense of it. Business sense. IT sense. Common sense. >> > http://p.sf.net/sfu/splunk-d2d-oct >> > _______________________________________________ >> > Scikit-learn-general mailing list >> > [email protected] >> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >> > >> > >> >> >> ------------------------------------------------------------------------------ >> All the data continuously generated in your IT infrastructure contains a >> definitive record of customers, application performance, security >> threats, fraudulent activity and more. Splunk takes this data and makes >> sense of it. Business sense. IT sense. Common sense. >> http://p.sf.net/sfu/splunk-d2d-oct >> _______________________________________________ >> Scikit-learn-general mailing list >> [email protected] >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > > > > -- > > > My public key can be found at: http://pgp.mit.edu/ > Search for this email address and select the key from "2011-08-19" (key id: > 54BA8735) > Older keys can be used, but please inform me beforehand (and update when > possible!) > > > ------------------------------------------------------------------------------ > All the data continuously generated in your IT infrastructure contains a > definitive record of customers, application performance, security > threats, fraudulent activity and more. Splunk takes this data and makes > sense of it. Business sense. IT sense. Common sense. > http://p.sf.net/sfu/splunk-d2d-oct > _______________________________________________ > Scikit-learn-general mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > > ------------------------------------------------------------------------------ All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity and more. Splunk takes this data and makes sense of it. Business sense. IT sense. Common sense. http://p.sf.net/sfu/splunk-d2d-oct _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
