Re: [Scikit-learn-general] Classifier that is perfectly stable given shuffled training data

Juan Nunez-Iglesias Tue, 03 Feb 2015 15:32:40 -0800

Hi everyone,




Yeah, I'm aware that floating point operations cause problems, but some 
numerically stable algorithms do better than naive approaches at preventing 
large deviations. I don't know how much better though. Been a while since I 
took numerical methods. =)




Re: data, sure, it's in my repo already! =) With this dataset:




https://github.com/jni/gala/blob/py3k/tests/example-data/train-set-1.npz





I produced this session:





In [1]: import numpy as np

In [2]: from sklearn.naive_bayes import GaussianNB

In [3]: tr = np.load('train-set-1.npz')

In [4]: X, y = tr['X'], tr['y'][:, 0]  # note y has >1 label per sample, use 
only first


In [5]: X.shape, y.shape

Out[5]: ((1002, 33), (1002,))






In [6]: nb = GaussianNB()

In [7]: nb.fit(X, y)

Out[7]: GaussianNB()




In [8]: shuffle_idxs = list(range(len(y)))

In [9]: shuffle(shuffle_idxs)

In [10]: Xs, ys = X[shuffle_idxs, :], y[shuffle_idxs]

In [11]: nb2 = GaussianNB()

In [12]: nb2.fit(Xs, ys)

Out[12]: GaussianNB()




In [13]: np.abs(nb.theta_ - nb2.theta_).max()

Out[13]: 9.9920072216264089e-16




In [14]: np.abs(nb.sigma_ - nb2.sigma_).max()

Out[14]: 1.9073486328125e-05







Thanks guys!




Juan.

On Tue, Feb 3, 2015 at 1:15 AM, Andy <t3k...@gmail.com> wrote:

> Hi Juan.
> For up to floating point precision, that is pretty hard as Gael 
> mentioned. 1e-5 on sigma seems pretty low, though.
> Can you post data to reproduce?
> I would expect most classifiers to go to around 1e-8.
> Cheers,
> Andreas
> On 02/02/2015 10:46 AM, Juan Nunez-Iglesias wrote:
>> Hi all,
>>
>> *TL;DR version:*
>> I'm looking for a classifier that will get the *exact same model* for 
>> shuffled versions of the training data. I thought GaussianNB would do 
>> the trick but either I don't understand it, or some kind of numerical 
>> instability prevents it from achieving the same model on subsequent 
>> shuffling of the data — I get about 1e-18 absolute tolerance on theta_ 
>> but only 1e-5 on sigma_. Thoughts?
>>
>> *Longer version with cute lesson learned:*
>> I hit another snag with testing for the Py2-3 transition on my 
>> sklearn-dependent library. This was a fun one to debug. Essentially, I 
>> was getting some training data, learning a random forest, and then 
>> checking the predict_proba() outcome on a test set. This was failing, 
>> so I assumed that somehow the seeding wasn't giving the same outcome 
>> in Py2 and 3. I checked up and down and sure enough, random seeding 
>> was working fine.
>>
>> The random change that *did* happen was because I was learning edges 
>> from a networkx graph. Fun fact: networkx.Graph.edges() is actually an 
>> iterator over dictionary keys, whose ordering is thus not guaranteed, 
>> /though it is perfectly reproducible across most implementations of 
>> Py2.7/. So, although my tests had been happily chugging along for a 
>> long time, this ordering changed in Py3.4, thus changing the order of 
>> the training data and the outcome of RandomForestClassifier().fit().
>>
>> I tried using GaussianNB() as the classifier but that still doesn't 
>> have reproducible behaviour between Python versions. Any other 
>> suggestions?
>>
>> Thanks!
>>
>> Juan.
>>
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Dive into the World of Parallel Programming. The Go Parallel Website,
>> sponsored by Intel and developed in partnership with Slashdot Media, is your
>> hub for all things parallel software development, from weekly thought
>> leadership blogs to news, videos, case studies, tutorials and more. Take a
>> look and join the conversation now. http://goparallel.sourceforge.net/
>>
>>
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
Dive into the World of Parallel Programming. The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net/

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Classifier that is perfectly stable given shuffled training data

Reply via email to