Hi Tim, The derivation of sigma(Rw-free) is in this paper: Acta Cryst. (2000). D56, 442-450. Tickle et al. Note the difference between the sigma of weighted/generalized/Hamilton R-free and that of the 'regular' R-free (there is a 2 there somewhere). From my own tests (10 fold cross-validation on 38 small datasets) I also find sigma(R-free) = R-free/sqrt(Ntest).
For large datasets you really do not need to do k-fold cross validation, because sigma(R-free) can be predicted quite well. We just need to realize that it exists, Cheers, Robbie > -----Original Message----- > From: CCP4 bulletin board [mailto:CCP4BB@JISCMAIL.AC.UK] On Behalf Of > Tim Gruene > Sent: Tuesday, March 26, 2013 11:05 > To: CCP4BB@JISCMAIL.AC.UK > Subject: Re: [ccp4bb] Rfree reflections > > Hi Robbie, > > thank you for the explanation. Heinz Gut and Michael Hadders pointed me at > Axel Brunger's publication Methods Enzymol. 1997;277:366-96., > http://www.ncbi.nlm.nih.gov/pubmed/18488318, which is where I got the > notion of > 500-1000 from. In this article a decrease of the error margin of Rfree with > n^(1/2) is mentioned (p.384), but only as an observation. Is your statement > "inverse proportional with the number of reflections" based on some > statistical treatment, or also just on observation? > > It is a pity that k-cross validation is not standard routine because it seems so > easy and so quickly to do with nowadays computers and a simple script. But > that's probably like reminding people of not using R_int anymore in favour of > R_meas... > > Cheers, > Tim > > On Tue, Mar 26, 2013 at 10:24:51AM +0100, Robbie Joosten wrote: > > Hi Tim, > > > > I don't think the 5-10% or 500-1000 reflections are real rules, but > > rather practical choices. The error margin in R-free is inverse > > proportional with the number of reflections in your test set and also > > proportional with R-free itself. So for R-free to be 'significant' you > > need some absolute number of reflections to reach your cut-off of > > significance. This is where the 1000 comes from (500 is really pushing the > limit). > > You want to make sure the error margin in R and R-free are not too far > > apart and you probably also want to keep the test set representative > > of the whole data set (this is particularly important because we use > > hold-out validation, you only get one shot at validating). This is where the > 5%-10% comes from. > > Another consideration for going for the 5%-10% thing is that this > > makes it feasible to do 'full' (i.e. k-fold) cross-validation: you > > only have to do > > 20-10 refinements. If you would go for 1000 reflections you would > > have to do 48 refinements for the average dataset. > > > > Personally, I take 5% and increase this percentage to maximum 10% if > > using 5% gives me a test set smaller than 1000 reflections. > > > > HTH, > > Robbie > > > > > -----Original Message----- > > > From: CCP4 bulletin board [mailto:CCP4BB@JISCMAIL.AC.UK] On Behalf > > > Of Tim Gruene > > > Sent: Tuesday, March 26, 2013 09:33 > > > To: CCP4BB@JISCMAIL.AC.UK > > > Subject: [ccp4bb] Rfree reflections > > > > > > Dear all, > > > > > > I recall that the set of Rfree reflections should be 500-1000, > > > rather than > > 5- > > > 10%, but I cannot find the reference for it (maybe Ian Tickle?). > > > > > > I would therefore like to be confirmed or corrected: > > > > > > Is there an absolute number required for Rfree to be significant, i.e. > > 500-1000 > > > irrespective of the total number of unique reflections in the data > > > set, or > > is it > > > 5-10% (as a compromise)? > > > > > > Thanks and regards, > > > Tim > > > > > > -- > > > -- > > > Dr Tim Gruene > > > Institut fuer anorganische Chemie > > > Tammannstr. 4 > > > D-37077 Goettingen > > > > > > GPG Key ID = A46BEE1A > > > > -- > -- > Dr Tim Gruene > Institut fuer anorganische Chemie > Tammannstr. 4 > D-37077 Goettingen > > GPG Key ID = A46BEE1A