[Scikit-learn-general] Classifier that is perfectly stable given shuffled training data

Juan Nunez-Iglesias Mon, 02 Feb 2015 01:47:51 -0800

Hi all,


TL;DR version:
I'm looking for a classifier that will get the *exact same model* for shuffled 
versions of the training data. I thought GaussianNB would do the trick but 
either I don't understand it, or some kind of numerical instability prevents it 
from achieving the same model on subsequent shuffling of the data — I get about 
1e-18 absolute tolerance on theta_ but only 1e-5 on sigma_. Thoughts?


Longer version with cute lesson learned:
I hit another snag with testing for the Py2-3 transition on my 
sklearn-dependent library. This was a fun one to debug. Essentially, I was 
getting some training data, learning a random forest, and then checking the 
predict_proba() outcome on a test set. This was failing, so I assumed that 
somehow the seeding wasn't giving the same outcome in Py2 and 3. I checked up 
and down and sure enough, random seeding was working fine.


The random change that *did* happen was because I was learning edges from a 
networkx graph. Fun fact: networkx.Graph.edges() is actually an iterator over 
dictionary keys, whose ordering is thus not guaranteed, though it is perfectly 
reproducible across most implementations of Py2.7. So, although my tests had 
been happily chugging along for a long time, this ordering changed in Py3.4, 
thus changing the order of the training data and the outcome of 
RandomForestClassifier().fit().


I tried using GaussianNB() as the classifier but that still doesn't have 
reproducible behaviour between Python versions. Any other suggestions?


Thanks!


Juan.

------------------------------------------------------------------------------
Dive into the World of Parallel Programming. The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net/

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

[Scikit-learn-general] Classifier that is perfectly stable given shuffled training data

Reply via email to