Re: [ccp4bb] How many is too many free reflections?

Axel Brunger Wed, 10 Jun 2015 01:33:13 -0700

Dear Dusan,

Following up on Gerard's comment, we also read your nice paper with great 
interest. Your method appears most useful for cases with a limited number of 
reflections (e.g., small unit cell and/or low resolution) resulting in 5% test 
sets with less than 1000 reflections in total. It improves the performance of 
your implementation of ML refinement for the cases that you described. However, 
we don't think that you can conclude that cross-validation is not needed 
anymore. To quote your paper, in the Discussion section:


"To address the use of R free as indicator of wrong structures, we repeated the 
Kleywegt and Jones experiment (Kleywegt & Jones, 1995; Kleywegt & Jones, 1997) 
and built the 2ahn structure in the reverse direction and then reﬁned it in the 
absence of solvent using the ML CV and ML FK approaches. Fig. 9 shows that 
Rfree stayed around 50% and Rfree–Rwork around 15% in the case of the reverse 
structure regardless of the ML approach and the fraction of data used in the 
test set. These values indicate that there is a fundamental problem with the 
structure, which supports the further use of Rfree as an indicator."

Thank you for reaffirming the utility of the statistical tool of 
cross-validation. The reverse chain trace of 2ahn is admittedly an extreme case 
of misfitting, and would probably be detected with other validation tools as 
well these days. However, the danger of overfitting or misfitting is still a 
very real possibility for large structures, especially when only moderate to 
low resolution data are available, even with today's tools.

Cross-validation can help even at very low resolution: in Structure 20, 957-966 
(2012) we showed that cross-validation is useful for certain low resolution 
refinements where additional restraints (DEN restraints in that case) are used 
to reduce overfitting and obtain a more accurate structure. Cross-validation 
made it possible to detect overfitting of the data when no DEN restraints were 
used. We believe this should also apply when other types of restraints are used 
(e.g., reference model restraints in phenix.refine, REFMAC, or BUSTER).  

In summary, we believe that cross-validation remains an important (and 
conceptually simple) method to detect overfitting and for overall structure 
validation.

Axel

Axel T. Brunger
Professor and Chair, Department of Molecular and Cellular Physiology
Investigator, HHMI
Email: [email protected] <mailto:[email protected]>
Phone: 650-736-1031
Web: http://atbweb.stanford.edu <http://atbweb.stanford.edu/>

Paul

Paul Adams
Deputy Division Director, Physical Biosciences Division, Lawrence Berkeley Lab
Division Deputy for Biosciences, Advanced Light Source, Lawrence Berkeley Lab
Adjunct Professor, Department of Bioengineering, U.C. Berkeley
Vice President for Technology, the Joint BioEnergy Institute
Laboratory Research Manager, ENIGMA Science Focus Area

Tel: 1-510-486-4225, Fax: 1-510-486-5909

http://cci.lbl.gov/paul <http://cci.lbl.gov/paul>
> On Jun 5, 2015, at 2:18 AM, Gerard Bricogne <[email protected]> wrote:
> 
> Dear Dusan,
> 
>     This is a nice paper and an interestingly different approach to
> avoiding bias and/or quantifying errors - and indeed there are all
> kinds of possibilities if you have a particular structure on which you
> are prepared to spend unlimited time and resources.
> 
>     The specific context in which Graeme's initial question led me to
> query instead "who should set the FreeR flags, at what stage and on
> what basis?" was that of the data analysis linked to high-throughput
> fragment screening, in which speed is of the essence at every step. 
> 
>     Creating FreeR flags afresh for each target-fragment complex
> dataset without any reference to those used in the refinement of the
> apo structure is by no means an irrecoverable error, but it will take
> extra computing time to let the refinement of the complex adjust to a
> new free set, starting from a model refined with the ignored one. It
> is in order to avoid the need for that extra time, or for a recourse
> to various debiasing methods, that the "book-keeping faff" described
> yesterday has been introduced. Operating without it is perfectly
> feasible, it is just likely to not be optimally direct.
> 
>     I will probably bow out here, before someone asks "How many
> [e-mails from me] is too many?" :-) .
> 
> 
>     With best wishes,
> 
>          Gerard.
> 
> --
> On Fri, Jun 05, 2015 at 09:14:18AM +0200, dusan turk wrote:
>> Graeme,
>> one more suggestion. You can avoid all the recipes by use all data for WORK 
>> set and 0 reflections for TEST set regardless of the amount of data by using 
>> the FREE KICK ML target. For explanation see our recent paper Praznikar, J. 
>> & Turk, D. (2014) Free kick instead of cross-validation in 
>> maximum-likelihood refinement of macromolecular crystal structures. Acta 
>> Cryst. D70, 3124-3134. 
>> 
>> Link to the paper you can find at “http://www-bmb.ijs.si/doc/references.HTML”
>> 
>> best,
>> dusan
>> 
>> 
>> 
>>> On Jun 5, 2015, at 1:03 AM, CCP4BB automatic digest system 
>>> <[email protected]> wrote:
>>> 
>>> Date:    Thu, 4 Jun 2015 08:30:57 +0000
>>> From:    Graeme Winter <[email protected]>
>>> Subject: Re: How many is too many free reflections?
>>> 
>>> Hi Folks,
>>> 
>>> Many thanks for all of your comments - in keeping with the spirit of the BB
>>> I have digested the responses below. Interestingly I suspect that the
>>> responses to this question indicate the very wide range of resolution
>>> limits of the data people work with!
>>> 
>>> Best wishes Graeme
>>> 
>>> ===================================
>>> 
>>> Proposal 1:
>>> 
>>> 10% reflections, max 2000
>>> 
>>> Proposal 2: from wiki:
>>> 
>>> http://strucbio.biologie.uni-konstanz.de/ccp4wiki/index.php/Test_set
>>> 
>>> including Randy Read "recipe":
>>> 
>>> So here's the recipe I would use, for what it's worth:
>>> <10000 reflections:        set aside 10%
>>>  10000-20000 reflections:  set aside 1000 reflections
>>>  20000-40000 reflections:  set aside 5%
>>>> 40000 reflections:        set aside 2000 reflections
>>> 
>>> Proposal 3:
>>> 
>>> 5% maximum 2-5k
>>> 
>>> Proposal 4:
>>> 
>>> 3% minimum 1000
>>> 
>>> Proposal 5:
>>> 
>>> 5-10% of reflections, minimum 1000
>>> 
>>> Proposal 6:
>>> 
>>>> 50 reflections per "bin" in order to get reliable ML parameter
>>> estimation, ideally around 150 / bin.
>>> 
>>> Proposal 7:
>>> 
>>> If lots of reflections (i.e. 800K unique) around 1% selected - 5% would be
>>> 40k i.e. rather a lot. Referees question use of > 5k reflections as test
>>> set.
>>> 
>>> Comment 1 in response to this:
>>> 
>>> Surely absolute # of test reflections is not relevant, percentage is.
>>> 
>>> ============================
>>> 
>>> Approximate consensus (i.e. what I will look at doing in xia2) - probably
>>> follow Randy Read recipe from ccp4wiki as this seems to (probably) satisfy
>>> most of the criteria raised by everyone else.
>>> 
>>> 
>>> 
>>> On Tue, Jun 2, 2015 at 11:26 AM Graeme Winter <[email protected]>
>>> wrote:
>>> 
>>>> Hi Folks
>>>> 
>>>> Had a vague comment handed my way that "xia2 assigns too many free
>>>> reflections" - I have a feeling that by default it makes a free set of 5%
>>>> which was OK back in the day (like I/sig(I) = 2 was OK) but maybe seems
>>>> excessive now.
>>>> 
>>>> This was particularly in the case of high resolution data where you have a
>>>> lot of reflections, so 5% could be several thousand which would be more
>>>> than you need to just check Rfree seems OK.
>>>> 
>>>> Since I really don't know what is the right # reflections to assign to a
>>>> free set thought I would ask here - what do you think? Essentially I need
>>>> to assign a minimum %age or minimum # - the lower of the two presumably?
>>>> 
>>>> Any comments welcome!
>>>> 
>>>> Thanks & best wishes Graeme
>>>> 
>>> 
>> 
>> Dr. Dusan Turk, Prof.
>> Head of Structural Biology Group http://bio.ijs.si/sbl/ 
>> Head of Centre for Protein  and Structure Production
>> Centre of excellence for Integrated Approaches in Chemistry and Biology of 
>> Proteins, Scientific Director
>> http://www.cipkebip.org/
>> Professor of Structural Biology at IPS "Jozef Stefan"
>> e-mail: [email protected]    
>> phone: +386 1 477 3857       Dept. of Biochem.& Mol.& Struct. Biol.
>> fax:   +386 1 477 3984       Jozef Stefan Institute
>>                            Jamova 39, 1 000 Ljubljana,Slovenia
>> Skype: dusan.turk (voice over internet: www.skype.com

Re: [ccp4bb] How many is too many free reflections?

Reply via email to