Thank you for the suggestions. The behavior persists after I tried them :-(. To 
answer Dale’s question, when I pass the array to random.choice, I get a 
ValueError that the probabilities do not sum to 1.

I found a line of code that seems to lead to the problem:   

numpy.power(...)

I have to raise each element of the matrix to a certain power. Once I do this, 
normalizing the rows of the matrix does not always yield a row sum of 1. 
*Without* this line, the rows *always* sum to 1. 

I removed the call to np.power and tested this with both sklearn’s normalize 
function and also by using apply_along_axis(lambda x: x / np.sum(x), 1, 
my_matrix) and both work. With the call to np.power though, neither method 
returns the correct result.

I suppose at this point this is more of a numpy question than a scikit-learn 
question.

— Ryan

> On Dec 17, 2015, at 5:09 AM, Dale Smith <dsm...@nexidia.com> wrote:
> 
> Ryan, did you try passing the arrays, as they are, to np.random.choice? Do 
> you get what you expect?
> 
> Dale Smith, Ph.D.
> Data Scientist
> ​
> 
> 
> d. 404.495.7220 x 4008   f. 404.795.7221
> Nexidia Corporate | 3565 Piedmont Road, Building Two, Suite 400 | Atlanta, GA 
> 30305
> 
>     
> 
> 
> -----Original Message-----
> From: Matthieu Brucher [mailto:matthieu.bruc...@gmail.com] 
> Sent: Thursday, December 17, 2015 7:56 AM
> To: scikit-learn-general@lists.sourceforge.net
> Subject: Re: [Scikit-learn-general] sklearn.preprocessing.normalize does not 
> sum to 1
> 
> The thing is that even if you did sum and divide by the sum, summing the 
> results back may not lead to 1.0. This is always the "issue" in floating 
> point computation.
> 
> Cheers,
> 
> Matthieu
> 
> 2015-12-17 8:26 GMT+01:00 Ryan R. Rosario <r...@bytemining.com>:
>> Hi,
>> 
>> I have a very large dense numpy matrix. To avoid running out of RAM, I use 
>> np.float32 as the dtype instead of the default np.float64 on my system.
>> 
>> When I do an L1 normalization of the rows (axis=1) in my matrix in-place 
>> (copy=False), I frequently get rows that do not sum to 1. Since these are 
>> probability distributions that I pass to np.random.choice, these must sum to 
>> exactly 1.0.
>> 
>> pp.normalize(term, norm='l1', axis=1, copy=False) sums = 
>> term.sum(axis=1) sums[np.where(sums != 1)]
>> 
>> array([ 0.99999994,  0.99999994,  1.00000012, ...,  0.99999994,
>>      0.99999994,  0.99999994], dtype=float32)
>> 
>> I wrote some code to manually add/subtract the small difference from 1 to 
>> each row, and I make some progress, but still all the rows do not sum to 1.
>> 
>> Is there a way to avoid this problem?
>> 
>> — Ryan
>> ----------------------------------------------------------------------
>> -------- _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> 
> 
> 
> --
> Information System Engineer, Ph.D.
> Blog: http://matt.eifelle.com
> LinkedIn: http://www.linkedin.com/in/matthieubrucher
> 
> ------------------------------------------------------------------------------
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> ------------------------------------------------------------------------------
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to