Ryan,


Have you tried a small problem to see if the float32 datatype is causing you 
problems? float64 is going to give 15-17 digits of precision, meaning you may 
not get to the exact 1.0 representation, especially with float32.



I am not sure this will help you, but take a look at numpy.memmap. You may be 
able to go back to float64.



https://github.com/lmjohns3/theanets/issues/59



After reading this carefully, I have more questions, so perhaps more digging is 
required.



I’d like to suggest that numpy code should not just “blow up” because of these 
types of issues. They are completely foreseeable. And perhaps someone on the 
numpy mailing list could help.




Dale Smith, Ph.D.
Data Scientist
​
[http://host.msgapp.com/Extranet/96621/Signature%20Images/sig%20logo.png]<http://nexidia.com/>

d. 404.495.7220 x 4008   f. 404.795.7221
Nexidia Corporate | 3565 Piedmont Road, Building Two, Suite 400 | Atlanta, GA 
30305

[http://host.msgapp.com/Extranet/96621/Signature%20Images/sig%20Blog.jpeg]<http://blog.nexidia.com/>
 [http://host.msgapp.com/Extranet/96621/Signature%20Images/sig%20LinkedIn.jpeg] 
<https://www.linkedin.com/company/nexidia>  
[http://host.msgapp.com/Extranet/96621/Signature%20Images/sig%20Google.jpeg] 
<https://plus.google.com/u/0/107921893643164441840/posts>  
[http://host.msgapp.com/Extranet/96621/Signature%20Images/sig%20twitter.jpeg] 
<https://twitter.com/Nexidia>  
[http://host.msgapp.com/Extranet/96621/Signature%20Images/sig%20Youtube.jpeg] 
<https://www.youtube.com/user/NexidiaTV>


-----Original Message-----
From: Ryan R. Rosario [mailto:r...@bytemining.com]
Sent: Thursday, December 17, 2015 2:26 AM
To: Scikit-learn-general@lists.sourceforge.net
Subject: [Scikit-learn-general] sklearn.preprocessing.normalize does not sum to 
1



Hi,



I have a very large dense numpy matrix. To avoid running out of RAM, I use 
np.float32 as the dtype instead of the default np.float64 on my system.



When I do an L1 normalization of the rows (axis=1) in my matrix in-place 
(copy=False), I frequently get rows that do not sum to 1. Since these are 
probability distributions that I pass to np.random.choice, these must sum to 
exactly 1.0.



pp.normalize(term, norm='l1', axis=1, copy=False) sums = term.sum(axis=1) 
sums[np.where(sums != 1)]



array([ 0.99999994,  0.99999994,  1.00000012, ...,  0.99999994,

      0.99999994,  0.99999994], dtype=float32)



I wrote some code to manually add/subtract the small difference from 1 to each 
row, and I make some progress, but still all the rows do not sum to 1.



Is there a way to avoid this problem?



— Ryan

------------------------------------------------------------------------------

_______________________________________________

Scikit-learn-general mailing list

Scikit-learn-general@lists.sourceforge.net<mailto:Scikit-learn-general@lists.sourceforge.net>

https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to