On 08/15/2013 01:08 PM, Jason Williams wrote:
> I follow the sample at 
> http://blog.yhathq.com/posts/random-forests-in-python.html where it randomly 
> assigns true, false to the dataset
>
>      np.random.uniform(0, 1, len(df)) <= .75
>
> then partition dataset into train set and test set. I use the same way for 
> creating model
>
In principle that sounds good.
For cross-validation you could also just use the cross_val_score 
function from the cross-validation module.

There are tree interpretations that come to my mind:
1) You dataset is just very easy, and the classifier learns it perfectly.
2) There is some trivial unwanted solution, for example you included the 
output label as a feature in the inputs.
3) There are strong correlations between the data points. If your 
dataset contains many near-duplicates,
then for each sample in the test-set there is a very similar sample in 
the training set and it got leaned by heart.
In this case you shouldn't assign to training and test-set randomly but 
respect the correlation structure of the data.
(One example would be having video frames from several sequences. Then 
you shouldn't assign the frames
randomly to training and test, but the whole sequence).

Hth,
Andy

------------------------------------------------------------------------------
Get 100% visibility into Java/.NET code with AppDynamics Lite!
It's a free troubleshooting tool designed for production.
Get down to code-level detail for bottlenecks, with <2% overhead. 
Download for free and get started troubleshooting in minutes. 
http://pubads.g.doubleclick.net/gampad/clk?id=48897031&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to