Hi Tim,

The derivation of sigma(Rw-free) is in this paper: Acta Cryst. (2000). D56,
442-450. Tickle et al.
Note the difference between the sigma of weighted/generalized/Hamilton
R-free and that of the 'regular' R-free (there is a 2 there somewhere). From
my own tests (10 fold cross-validation on 38 small datasets) I also find
sigma(R-free) = R-free/sqrt(Ntest).

For large datasets you really do not need to do k-fold cross validation,
because sigma(R-free) can be predicted quite well. We just need to realize
that it exists,

Cheers,
Robbie
 
> -----Original Message-----
> From: CCP4 bulletin board [mailto:CCP4BB@JISCMAIL.AC.UK] On Behalf Of
> Tim Gruene
> Sent: Tuesday, March 26, 2013 11:05
> To: CCP4BB@JISCMAIL.AC.UK
> Subject: Re: [ccp4bb] Rfree reflections
> 
> Hi Robbie,
> 
> thank you for the explanation. Heinz Gut and Michael Hadders pointed me at
> Axel Brunger's publication Methods Enzymol. 1997;277:366-96.,
> http://www.ncbi.nlm.nih.gov/pubmed/18488318, which is where I got the
> notion of
> 500-1000 from. In this article a decrease of the error margin of Rfree
with
> n^(1/2) is mentioned (p.384), but only as an observation. Is your
statement
> "inverse proportional with the number of reflections" based on some
> statistical treatment, or also just on observation?
> 
> It is a pity that k-cross validation is not standard routine because it
seems so
> easy and so quickly to do with nowadays computers and a simple script. But
> that's probably like reminding people of not using R_int anymore in favour
of
> R_meas...
> 
> Cheers,
> Tim
> 
> On Tue, Mar 26, 2013 at 10:24:51AM +0100, Robbie Joosten wrote:
> > Hi Tim,
> >
> > I don't think the 5-10% or 500-1000 reflections are real rules, but
> > rather practical choices. The error margin in R-free is inverse
> > proportional with the number of reflections in your test set and also
> > proportional with R-free itself. So for R-free to be 'significant' you
> > need some absolute number of reflections to reach your cut-off of
> > significance. This is where the 1000 comes from (500 is really pushing
the
> limit).
> > You want to make sure the error margin in R and R-free are not too far
> > apart and you probably also want to keep the test set representative
> > of the whole data set (this is particularly important because we use
> > hold-out validation, you only get one shot at validating). This is where
the
> 5%-10% comes from.
> > Another consideration for going for the 5%-10% thing is that this
> > makes it feasible to do 'full' (i.e. k-fold) cross-validation: you
> > only have to do
> > 20-10 refinements.  If you would go for 1000 reflections you would
> > have to do 48 refinements for the average dataset.
> >
> > Personally, I take 5% and increase this percentage to maximum 10% if
> > using 5% gives me a test set smaller than 1000 reflections.
> >
> > HTH,
> > Robbie
> >
> > > -----Original Message-----
> > > From: CCP4 bulletin board [mailto:CCP4BB@JISCMAIL.AC.UK] On Behalf
> > > Of Tim Gruene
> > > Sent: Tuesday, March 26, 2013 09:33
> > > To: CCP4BB@JISCMAIL.AC.UK
> > > Subject: [ccp4bb] Rfree reflections
> > >
> > > Dear all,
> > >
> > > I recall that the set of Rfree reflections should be 500-1000,
> > > rather than
> > 5-
> > > 10%, but I cannot find the reference for it (maybe Ian Tickle?).
> > >
> > > I would therefore like to be confirmed or corrected:
> > >
> > > Is there an absolute number required for Rfree to be significant, i.e.
> > 500-1000
> > > irrespective of the total number of unique reflections in the data
> > > set, or
> > is it
> > > 5-10% (as a compromise)?
> > >
> > > Thanks and regards,
> > > Tim
> > >
> > > --
> > > --
> > > Dr Tim Gruene
> > > Institut fuer anorganische Chemie
> > > Tammannstr. 4
> > > D-37077 Goettingen
> > >
> > > GPG Key ID = A46BEE1A
> >
> 
> --
> --
> Dr Tim Gruene
> Institut fuer anorganische Chemie
> Tammannstr. 4
> D-37077 Goettingen
> 
> GPG Key ID = A46BEE1A

Reply via email to