Thank you for the suggestions. The behavior persists after I tried them :-(. To answer Dale’s question, when I pass the array to random.choice, I get a ValueError that the probabilities do not sum to 1.
I found a line of code that seems to lead to the problem: numpy.power(...) I have to raise each element of the matrix to a certain power. Once I do this, normalizing the rows of the matrix does not always yield a row sum of 1. *Without* this line, the rows *always* sum to 1. I removed the call to np.power and tested this with both sklearn’s normalize function and also by using apply_along_axis(lambda x: x / np.sum(x), 1, my_matrix) and both work. With the call to np.power though, neither method returns the correct result. I suppose at this point this is more of a numpy question than a scikit-learn question. — Ryan > On Dec 17, 2015, at 5:09 AM, Dale Smith <dsm...@nexidia.com> wrote: > > Ryan, did you try passing the arrays, as they are, to np.random.choice? Do > you get what you expect? > > Dale Smith, Ph.D. > Data Scientist > > > > d. 404.495.7220 x 4008 f. 404.795.7221 > Nexidia Corporate | 3565 Piedmont Road, Building Two, Suite 400 | Atlanta, GA > 30305 > > > > > -----Original Message----- > From: Matthieu Brucher [mailto:matthieu.bruc...@gmail.com] > Sent: Thursday, December 17, 2015 7:56 AM > To: scikit-learn-general@lists.sourceforge.net > Subject: Re: [Scikit-learn-general] sklearn.preprocessing.normalize does not > sum to 1 > > The thing is that even if you did sum and divide by the sum, summing the > results back may not lead to 1.0. This is always the "issue" in floating > point computation. > > Cheers, > > Matthieu > > 2015-12-17 8:26 GMT+01:00 Ryan R. Rosario <r...@bytemining.com>: >> Hi, >> >> I have a very large dense numpy matrix. To avoid running out of RAM, I use >> np.float32 as the dtype instead of the default np.float64 on my system. >> >> When I do an L1 normalization of the rows (axis=1) in my matrix in-place >> (copy=False), I frequently get rows that do not sum to 1. Since these are >> probability distributions that I pass to np.random.choice, these must sum to >> exactly 1.0. >> >> pp.normalize(term, norm='l1', axis=1, copy=False) sums = >> term.sum(axis=1) sums[np.where(sums != 1)] >> >> array([ 0.99999994, 0.99999994, 1.00000012, ..., 0.99999994, >> 0.99999994, 0.99999994], dtype=float32) >> >> I wrote some code to manually add/subtract the small difference from 1 to >> each row, and I make some progress, but still all the rows do not sum to 1. >> >> Is there a way to avoid this problem? >> >> — Ryan >> ---------------------------------------------------------------------- >> -------- _______________________________________________ >> Scikit-learn-general mailing list >> Scikit-learn-general@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > > > > -- > Information System Engineer, Ph.D. > Blog: http://matt.eifelle.com > LinkedIn: http://www.linkedin.com/in/matthieubrucher > > ------------------------------------------------------------------------------ > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > ------------------------------------------------------------------------------ > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general ------------------------------------------------------------------------------ _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general