Dear Dusan, Following up on Gerard's comment, we also read your nice paper with great interest. Your method appears most useful for cases with a limited number of reflections (e.g., small unit cell and/or low resolution) resulting in 5% test sets with less than 1000 reflections in total. It improves the performance of your implementation of ML refinement for the cases that you described. However, we don't think that you can conclude that cross-validation is not needed anymore. To quote your paper, in the Discussion section:
"To address the use of R free as indicator of wrong structures, we repeated the Kleywegt and Jones experiment (Kleywegt & Jones, 1995; Kleywegt & Jones, 1997) and built the 2ahn structure in the reverse direction and then refined it in the absence of solvent using the ML CV and ML FK approaches. Fig. 9 shows that Rfree stayed around 50% and Rfree–Rwork around 15% in the case of the reverse structure regardless of the ML approach and the fraction of data used in the test set. These values indicate that there is a fundamental problem with the structure, which supports the further use of Rfree as an indicator." Thank you for reaffirming the utility of the statistical tool of cross-validation. The reverse chain trace of 2ahn is admittedly an extreme case of misfitting, and would probably be detected with other validation tools as well these days. However, the danger of overfitting or misfitting is still a very real possibility for large structures, especially when only moderate to low resolution data are available, even with today's tools. Cross-validation can help even at very low resolution: in Structure 20, 957-966 (2012) we showed that cross-validation is useful for certain low resolution refinements where additional restraints (DEN restraints in that case) are used to reduce overfitting and obtain a more accurate structure. Cross-validation made it possible to detect overfitting of the data when no DEN restraints were used. We believe this should also apply when other types of restraints are used (e.g., reference model restraints in phenix.refine, REFMAC, or BUSTER). In summary, we believe that cross-validation remains an important (and conceptually simple) method to detect overfitting and for overall structure validation. Axel Axel T. Brunger Professor and Chair, Department of Molecular and Cellular Physiology Investigator, HHMI Email: [email protected] <mailto:[email protected]> Phone: 650-736-1031 Web: http://atbweb.stanford.edu <http://atbweb.stanford.edu/> Paul Paul Adams Deputy Division Director, Physical Biosciences Division, Lawrence Berkeley Lab Division Deputy for Biosciences, Advanced Light Source, Lawrence Berkeley Lab Adjunct Professor, Department of Bioengineering, U.C. Berkeley Vice President for Technology, the Joint BioEnergy Institute Laboratory Research Manager, ENIGMA Science Focus Area Tel: 1-510-486-4225, Fax: 1-510-486-5909 http://cci.lbl.gov/paul <http://cci.lbl.gov/paul> > On Jun 5, 2015, at 2:18 AM, Gerard Bricogne <[email protected]> wrote: > > Dear Dusan, > > This is a nice paper and an interestingly different approach to > avoiding bias and/or quantifying errors - and indeed there are all > kinds of possibilities if you have a particular structure on which you > are prepared to spend unlimited time and resources. > > The specific context in which Graeme's initial question led me to > query instead "who should set the FreeR flags, at what stage and on > what basis?" was that of the data analysis linked to high-throughput > fragment screening, in which speed is of the essence at every step. > > Creating FreeR flags afresh for each target-fragment complex > dataset without any reference to those used in the refinement of the > apo structure is by no means an irrecoverable error, but it will take > extra computing time to let the refinement of the complex adjust to a > new free set, starting from a model refined with the ignored one. It > is in order to avoid the need for that extra time, or for a recourse > to various debiasing methods, that the "book-keeping faff" described > yesterday has been introduced. Operating without it is perfectly > feasible, it is just likely to not be optimally direct. > > I will probably bow out here, before someone asks "How many > [e-mails from me] is too many?" :-) . > > > With best wishes, > > Gerard. > > -- > On Fri, Jun 05, 2015 at 09:14:18AM +0200, dusan turk wrote: >> Graeme, >> one more suggestion. You can avoid all the recipes by use all data for WORK >> set and 0 reflections for TEST set regardless of the amount of data by using >> the FREE KICK ML target. For explanation see our recent paper Praznikar, J. >> & Turk, D. (2014) Free kick instead of cross-validation in >> maximum-likelihood refinement of macromolecular crystal structures. Acta >> Cryst. D70, 3124-3134. >> >> Link to the paper you can find at “http://www-bmb.ijs.si/doc/references.HTML” >> >> best, >> dusan >> >> >> >>> On Jun 5, 2015, at 1:03 AM, CCP4BB automatic digest system >>> <[email protected]> wrote: >>> >>> Date: Thu, 4 Jun 2015 08:30:57 +0000 >>> From: Graeme Winter <[email protected]> >>> Subject: Re: How many is too many free reflections? >>> >>> Hi Folks, >>> >>> Many thanks for all of your comments - in keeping with the spirit of the BB >>> I have digested the responses below. Interestingly I suspect that the >>> responses to this question indicate the very wide range of resolution >>> limits of the data people work with! >>> >>> Best wishes Graeme >>> >>> =================================== >>> >>> Proposal 1: >>> >>> 10% reflections, max 2000 >>> >>> Proposal 2: from wiki: >>> >>> http://strucbio.biologie.uni-konstanz.de/ccp4wiki/index.php/Test_set >>> >>> including Randy Read "recipe": >>> >>> So here's the recipe I would use, for what it's worth: >>> <10000 reflections: set aside 10% >>> 10000-20000 reflections: set aside 1000 reflections >>> 20000-40000 reflections: set aside 5% >>>> 40000 reflections: set aside 2000 reflections >>> >>> Proposal 3: >>> >>> 5% maximum 2-5k >>> >>> Proposal 4: >>> >>> 3% minimum 1000 >>> >>> Proposal 5: >>> >>> 5-10% of reflections, minimum 1000 >>> >>> Proposal 6: >>> >>>> 50 reflections per "bin" in order to get reliable ML parameter >>> estimation, ideally around 150 / bin. >>> >>> Proposal 7: >>> >>> If lots of reflections (i.e. 800K unique) around 1% selected - 5% would be >>> 40k i.e. rather a lot. Referees question use of > 5k reflections as test >>> set. >>> >>> Comment 1 in response to this: >>> >>> Surely absolute # of test reflections is not relevant, percentage is. >>> >>> ============================ >>> >>> Approximate consensus (i.e. what I will look at doing in xia2) - probably >>> follow Randy Read recipe from ccp4wiki as this seems to (probably) satisfy >>> most of the criteria raised by everyone else. >>> >>> >>> >>> On Tue, Jun 2, 2015 at 11:26 AM Graeme Winter <[email protected]> >>> wrote: >>> >>>> Hi Folks >>>> >>>> Had a vague comment handed my way that "xia2 assigns too many free >>>> reflections" - I have a feeling that by default it makes a free set of 5% >>>> which was OK back in the day (like I/sig(I) = 2 was OK) but maybe seems >>>> excessive now. >>>> >>>> This was particularly in the case of high resolution data where you have a >>>> lot of reflections, so 5% could be several thousand which would be more >>>> than you need to just check Rfree seems OK. >>>> >>>> Since I really don't know what is the right # reflections to assign to a >>>> free set thought I would ask here - what do you think? Essentially I need >>>> to assign a minimum %age or minimum # - the lower of the two presumably? >>>> >>>> Any comments welcome! >>>> >>>> Thanks & best wishes Graeme >>>> >>> >> >> Dr. Dusan Turk, Prof. >> Head of Structural Biology Group http://bio.ijs.si/sbl/ >> Head of Centre for Protein and Structure Production >> Centre of excellence for Integrated Approaches in Chemistry and Biology of >> Proteins, Scientific Director >> http://www.cipkebip.org/ >> Professor of Structural Biology at IPS "Jozef Stefan" >> e-mail: [email protected] >> phone: +386 1 477 3857 Dept. of Biochem.& Mol.& Struct. Biol. >> fax: +386 1 477 3984 Jozef Stefan Institute >> Jamova 39, 1 000 Ljubljana,Slovenia >> Skype: dusan.turk (voice over internet: www.skype.com
