On 08/15/2013 01:08 PM, Jason Williams wrote: > I follow the sample at > http://blog.yhathq.com/posts/random-forests-in-python.html where it randomly > assigns true, false to the dataset > > np.random.uniform(0, 1, len(df)) <= .75 > > then partition dataset into train set and test set. I use the same way for > creating model > In principle that sounds good. For cross-validation you could also just use the cross_val_score function from the cross-validation module.
There are tree interpretations that come to my mind: 1) You dataset is just very easy, and the classifier learns it perfectly. 2) There is some trivial unwanted solution, for example you included the output label as a feature in the inputs. 3) There are strong correlations between the data points. If your dataset contains many near-duplicates, then for each sample in the test-set there is a very similar sample in the training set and it got leaned by heart. In this case you shouldn't assign to training and test-set randomly but respect the correlation structure of the data. (One example would be having video frames from several sequences. Then you shouldn't assign the frames randomly to training and test, but the whole sequence). Hth, Andy ------------------------------------------------------------------------------ Get 100% visibility into Java/.NET code with AppDynamics Lite! It's a free troubleshooting tool designed for production. Get down to code-level detail for bottlenecks, with <2% overhead. Download for free and get started troubleshooting in minutes. http://pubads.g.doubleclick.net/gampad/clk?id=48897031&iu=/4140/ostg.clktrk _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general