Re: [Scikit-learn-general] congratulations to Peter and to scikit-learn!

Peter Prettenhofer Thu, 05 Jul 2012 07:57:56 -0700

2012/7/5 Olivier Grisel <[email protected]>:
> 2012/7/5 Emanuele Olivetti <[email protected]>:
>> On 07/05/2012 09:45 AM, Andreas Mueller wrote:
>>> Hey Peter.
>>> Pretty awesome feat! Thanks for all the work you put into the ensemble
>>> module!
>>>
>>> A blog post about this competition would really be great :)
>>>
>>> I was wondering, was there much difference in performance between GBRT
>>> and RF?
>>
>> Hi,
>>
>> I participated to the competition (with not so good results ;-)) but
>> there was indeed a significant difference between RF and GB. I tried RF, 
>> ExtraTrees
>> and GradientBoosting and got a RMSLE of ~0.60 using the first two (with
>> blending and a preprocessing procedure similar to the one of Peter) and
>> ~0.578 with GB. OK, you might say that the difference is little but there
>> are tens of positions in the final rank between them ;-)
>
> This is amazing that the intrinsic variance of those is estimator is
> so small as to be able to say that a GBRT with mean loss around 0.578
> beats RF with mean loss around 0.60 in a significant manner.


What do you mean by intrinsic variance exactly? Do you mean how
performance varies w.r.t. to difference in the training set? If so,
the question is whether they differ consistently.

>
> What is the order of magnitude std error of 10 folds cross validation
> on this problem? 0.1? less?
>
> Is the ordering of the validation set and the final test set changed
> much? Did you observe wide differences between you internal CV score
> and the final test scores ?

Given the small differences on the leaderboard, final rankings were
rather stable - compared to the recently finished "Biological
Response" challenge where the second rank on the leaderboard ended up
on final rank 29.

Some contestants reported wide differences between internal CV scores
and leaderboard scores; My internal CV scores tracked the leaderboard
scores quite well (see [1]);
Model selection is my nemesis - little can be gained, everything lost :-)

In the end I did 5x 5-fold CV - the error std between the repetitions
was around 0.005. I need to rerun my best model to get the error std
within a single repetition.

[1] http://www.kaggle.com/c/online-sales/forums/t/1958/cross-validation-errors

>
> --
> Olivier
> http://twitter.com/ogrisel - http://github.com/ogrisel
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



-- 
Peter Prettenhofer

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] congratulations to Peter and to scikit-learn!

Reply via email to