Re: [ccp4bb] AW: [ccp4bb] over-fitting? over-refinement?
I too have seen some horrendous low-resolution models, with correspondingly bad validation statistics. A little time spent cleaning up the outliers (geometric and others) rarely results in large reductions in R(free) for these types of datasets & models, but ultimately we as a community need to emphasize that the R(free) is not the be all and end all as a quality metric. Diana ** Diana R. Tomchick Professor Departments of Biophysics and Biochemistry UT Southwestern Medical Center 5323 Harry Hines Blvd. Rm. ND10.214A Dallas, TX 75390-8816 diana.tomch...@utsouthwestern.edu (214) 645-6383 (phone) (214) 645-6353 (fax) From: CCP4 bulletin board on behalf of Tristan Croll Sent: Tuesday, October 20, 2020 7:11 AM To: CCP4BB@JISCMAIL.AC.UK Subject: Re: [ccp4bb] AW: [ccp4bb] over-fitting? over-refinement? EXTERNAL MAIL I'd like to append a very important caveat to this discussion: most of the talk on Rfree as protection against overfitting is perfectly correct, if your dataset is high enough resolution. Remember that Rfree only provides protection against one form of overfitting: that is, fitting of atoms into random noise. What it doesn't protect well against is fitting the wrong atoms into real density. Remember, all your x-ray data ultimately says is "there are electrons here" - your R-factors don't care where those electrons come from, as long as they're present in about the right numbers (with some fudge-room for B-factors and occupancies). If you browse through the back catalogue of >3A models, you'll find some with horrendous geometry statistics but remarkably good R-factors (both work and free) - ultimately, I think, because the model atoms have been "overstuffed" into density that is real according to both the working and free data. In quite a few such cases I find that even after extensive reworking I'm unable to beat the original R-free, despite every other metric improving markedly. Best regards, Tristan From: CCP4 bulletin board on behalf of Barone, Matthias Sent: 20 October 2020 12:59 To: CCP4BB@JISCMAIL.AC.UK Subject: Re: [ccp4bb] AW: [ccp4bb] over-fitting? over-refinement? Eleanor rises a very important practical point here..."sidechains at the solvent interface have multiple conformations, and that as a result the water networks should also have partial occupancies". I was fighting with such a model for half a year and also tested XSHEL (there was a thread in here for that..). Coupling partial occupancies of sidchains with waters and other sidchains is a horrendously time-consuming task...and in the end, as Eleanor said, "correcting these details does not change the Rfactors at all". You just get fed up with that puzzle and stop right there. best, matthias Dr. Matthias Barone AG Kuehne, Rational Drug Design Leibniz-Forschungsinstitut für Molekulare Pharmakologie (FMP) Robert-Rössle-Strasse 10 13125 Berlin Germany Phone: +49 (0)30 94793-284 From: CCP4 bulletin board on behalf of Eleanor Dodson <176a9d5ebad7-dmarc-requ...@jiscmail.ac.uk> Sent: Tuesday, October 20, 2020 12:40:19 PM To: CCP4BB@JISCMAIL.AC.UK Subject: Re: [ccp4bb] AW: [ccp4bb] over-fitting? over-refinement? It is always hard to know when to stop tweaking a model.. We know from high resolution studies that many sidechains at the solvent interface have multiple conformations, and that as a result the water networks should also have partial occupancies. But usually correcting these details does not change the Rfactors at all - nor contribute much to the biological relevance of your structure! So often the point to stop is when you get fed up, Phil Evans said years ago - I spend 95% of my time on 5% of the structure, most of which is unimportant.. In practice I let the difference maps decide when to stop - 10 Sigma peak - think why - lots of 5 Sigma positive and negative ones not so important Eleanor On Tue, 20 Oct 2020 at 11:27, Schreuder, Herman /DE mailto:herman.schreu...@sanofi.com>> wrote: A practice that was very popular before the Rfree came around was to fit a water molecule in every noise peak. One would get spectacular low Rfactors this way, but I cannot imagine that anyone would believe that this would be fitting and not over-fitting. Best, Herman Von: CCP4 bulletin board mailto:CCP4BB@JISCMAIL.AC.UK>> Im Auftrag von Sam Tang Gesendet: Dienstag, 20. Oktober 2020 05:27 An: CCP4BB@JISCMAIL.AC.UK<mailto:CCP4BB@JISCMAIL.AC.UK> Betreff: [ccp4bb] over-fitting? over-refinement? Hi, the question may be a bit weird, but how do you define 'over-fitting' in the context of structure refinement? From users' perspective the practical aspect is to 'fit' the model into the density. So there comes this questio
Re: [ccp4bb] AW: [ccp4bb] over-fitting? over-refinement?
I'd like to append a very important caveat to this discussion: most of the talk on Rfree as protection against overfitting is perfectly correct, if your dataset is high enough resolution. Remember that Rfree only provides protection against one form of overfitting: that is, fitting of atoms into random noise. What it doesn't protect well against is fitting the wrong atoms into real density. Remember, all your x-ray data ultimately says is "there are electrons here" - your R-factors don't care where those electrons come from, as long as they're present in about the right numbers (with some fudge-room for B-factors and occupancies). If you browse through the back catalogue of >3A models, you'll find some with horrendous geometry statistics but remarkably good R-factors (both work and free) - ultimately, I think, because the model atoms have been "overstuffed" into density that is real according to both the working and free data. In quite a few such cases I find that even after extensive reworking I'm unable to beat the original R-free, despite every other metric improving markedly. Best regards, Tristan From: CCP4 bulletin board on behalf of Barone, Matthias Sent: 20 October 2020 12:59 To: CCP4BB@JISCMAIL.AC.UK Subject: Re: [ccp4bb] AW: [ccp4bb] over-fitting? over-refinement? Eleanor rises a very important practical point here..."sidechains at the solvent interface have multiple conformations, and that as a result the water networks should also have partial occupancies". I was fighting with such a model for half a year and also tested XSHEL (there was a thread in here for that..). Coupling partial occupancies of sidchains with waters and other sidchains is a horrendously time-consuming task...and in the end, as Eleanor said, "correcting these details does not change the Rfactors at all". You just get fed up with that puzzle and stop right there. best, matthias Dr. Matthias Barone AG Kuehne, Rational Drug Design Leibniz-Forschungsinstitut für Molekulare Pharmakologie (FMP) Robert-Rössle-Strasse 10 13125 Berlin Germany Phone: +49 (0)30 94793-284 From: CCP4 bulletin board on behalf of Eleanor Dodson <176a9d5ebad7-dmarc-requ...@jiscmail.ac.uk> Sent: Tuesday, October 20, 2020 12:40:19 PM To: CCP4BB@JISCMAIL.AC.UK Subject: Re: [ccp4bb] AW: [ccp4bb] over-fitting? over-refinement? It is always hard to know when to stop tweaking a model.. We know from high resolution studies that many sidechains at the solvent interface have multiple conformations, and that as a result the water networks should also have partial occupancies. But usually correcting these details does not change the Rfactors at all - nor contribute much to the biological relevance of your structure! So often the point to stop is when you get fed up, Phil Evans said years ago - I spend 95% of my time on 5% of the structure, most of which is unimportant.. In practice I let the difference maps decide when to stop - 10 Sigma peak - think why - lots of 5 Sigma positive and negative ones not so important Eleanor On Tue, 20 Oct 2020 at 11:27, Schreuder, Herman /DE mailto:herman.schreu...@sanofi.com>> wrote: A practice that was very popular before the Rfree came around was to fit a water molecule in every noise peak. One would get spectacular low Rfactors this way, but I cannot imagine that anyone would believe that this would be fitting and not over-fitting. Best, Herman Von: CCP4 bulletin board mailto:CCP4BB@JISCMAIL.AC.UK>> Im Auftrag von Sam Tang Gesendet: Dienstag, 20. Oktober 2020 05:27 An: CCP4BB@JISCMAIL.AC.UK<mailto:CCP4BB@JISCMAIL.AC.UK> Betreff: [ccp4bb] over-fitting? over-refinement? Hi, the question may be a bit weird, but how do you define 'over-fitting' in the context of structure refinement? From users' perspective the practical aspect is to 'fit' the model into the density. So there comes this question from our juniors: fit is fit, how is a model over-fit? BRS Sam To unsubscribe from the CCP4BB list, click the following link: https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB=1<https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.jiscmail.ac.uk%2Fcgi-bin%2FWA-JISC.exe%3FSUBED1%3DCCP4BB%26A%3D1=04%7C01%7CHerman.Schreuder%40SANOFI.COM%7Cfca18f01417745b3655008d874a81d74%7Caca3c8d6aa714e1aa10e03572fc58c0b%7C0%7C0%7C637387612965782189%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000=s1FpY1ufI7De3N6J2%2FivUy4zehp%2BcGl1gGjHeNrzUeA%3D=0> To unsubscribe from the CCP4BB list, click the following link: https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB=1 To unsubscribe from the CCP4BB list, click the follo
Re: [ccp4bb] AW: [ccp4bb] over-fitting? over-refinement?
Eleanor rises a very important practical point here..."sidechains at the solvent interface have multiple conformations, and that as a result the water networks should also have partial occupancies". I was fighting with such a model for half a year and also tested XSHEL (there was a thread in here for that..). Coupling partial occupancies of sidchains with waters and other sidchains is a horrendously time-consuming task...and in the end, as Eleanor said, "correcting these details does not change the Rfactors at all". You just get fed up with that puzzle and stop right there. best, matthias Dr. Matthias Barone AG Kuehne, Rational Drug Design Leibniz-Forschungsinstitut für Molekulare Pharmakologie (FMP) Robert-Rössle-Strasse 10 13125 Berlin Germany Phone: +49 (0)30 94793-284 From: CCP4 bulletin board on behalf of Eleanor Dodson <176a9d5ebad7-dmarc-requ...@jiscmail.ac.uk> Sent: Tuesday, October 20, 2020 12:40:19 PM To: CCP4BB@JISCMAIL.AC.UK Subject: Re: [ccp4bb] AW: [ccp4bb] over-fitting? over-refinement? It is always hard to know when to stop tweaking a model.. We know from high resolution studies that many sidechains at the solvent interface have multiple conformations, and that as a result the water networks should also have partial occupancies. But usually correcting these details does not change the Rfactors at all - nor contribute much to the biological relevance of your structure! So often the point to stop is when you get fed up, Phil Evans said years ago - I spend 95% of my time on 5% of the structure, most of which is unimportant.. In practice I let the difference maps decide when to stop - 10 Sigma peak - think why - lots of 5 Sigma positive and negative ones not so important Eleanor On Tue, 20 Oct 2020 at 11:27, Schreuder, Herman /DE mailto:herman.schreu...@sanofi.com>> wrote: A practice that was very popular before the Rfree came around was to fit a water molecule in every noise peak. One would get spectacular low Rfactors this way, but I cannot imagine that anyone would believe that this would be fitting and not over-fitting. Best, Herman Von: CCP4 bulletin board mailto:CCP4BB@JISCMAIL.AC.UK>> Im Auftrag von Sam Tang Gesendet: Dienstag, 20. Oktober 2020 05:27 An: CCP4BB@JISCMAIL.AC.UK<mailto:CCP4BB@JISCMAIL.AC.UK> Betreff: [ccp4bb] over-fitting? over-refinement? Hi, the question may be a bit weird, but how do you define 'over-fitting' in the context of structure refinement? From users' perspective the practical aspect is to 'fit' the model into the density. So there comes this question from our juniors: fit is fit, how is a model over-fit? BRS Sam To unsubscribe from the CCP4BB list, click the following link: https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB=1<https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.jiscmail.ac.uk%2Fcgi-bin%2FWA-JISC.exe%3FSUBED1%3DCCP4BB%26A%3D1=04%7C01%7CHerman.Schreuder%40SANOFI.COM%7Cfca18f01417745b3655008d874a81d74%7Caca3c8d6aa714e1aa10e03572fc58c0b%7C0%7C0%7C637387612965782189%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000=s1FpY1ufI7De3N6J2%2FivUy4zehp%2BcGl1gGjHeNrzUeA%3D=0> To unsubscribe from the CCP4BB list, click the following link: https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB=1 To unsubscribe from the CCP4BB list, click the following link: https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB=1 To unsubscribe from the CCP4BB list, click the following link: https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB=1 This message was issued to members of www.jiscmail.ac.uk/CCP4BB, a mailing list hosted by www.jiscmail.ac.uk, terms & conditions are available at https://www.jiscmail.ac.uk/policyandsecurity/
Re: [ccp4bb] AW: [ccp4bb] over-fitting? over-refinement?
It is always hard to know when to stop tweaking a model.. We know from high resolution studies that many sidechains at the solvent interface have multiple conformations, and that as a result the water networks should also have partial occupancies. But usually correcting these details does not change the Rfactors at all - nor contribute much to the biological relevance of your structure! So often the point to stop is when you get fed up, Phil Evans said years ago - I spend 95% of my time on 5% of the structure, most of which is unimportant.. In practice I let the difference maps decide when to stop - 10 Sigma peak - think why - lots of 5 Sigma positive and negative ones not so important Eleanor On Tue, 20 Oct 2020 at 11:27, Schreuder, Herman /DE < herman.schreu...@sanofi.com> wrote: > A practice that was very popular before the Rfree came around was to fit a > water molecule in every noise peak. One would get spectacular low Rfactors > this way, but I cannot imagine that anyone would believe that this would be > fitting and not over-fitting. > > > > Best, > > Herman > > > > *Von:* CCP4 bulletin board *Im Auftrag von *Sam > Tang > *Gesendet:* Dienstag, 20. Oktober 2020 05:27 > *An:* CCP4BB@JISCMAIL.AC.UK > *Betreff:* [ccp4bb] over-fitting? over-refinement? > > > > Hi, the question may be a bit weird, but how do you define 'over-fitting' > in the context of structure refinement? From users' perspective the > practical aspect is to 'fit' the model into the density. So there comes > this question from our juniors: fit is fit, how is a model over-fit? > > > > BRS > > > > Sam > > > -- > > To unsubscribe from the CCP4BB list, click the following link: > https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB=1 > <https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.jiscmail.ac.uk%2Fcgi-bin%2FWA-JISC.exe%3FSUBED1%3DCCP4BB%26A%3D1=04%7C01%7CHerman.Schreuder%40SANOFI.COM%7Cfca18f01417745b3655008d874a81d74%7Caca3c8d6aa714e1aa10e03572fc58c0b%7C0%7C0%7C637387612965782189%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000=s1FpY1ufI7De3N6J2%2FivUy4zehp%2BcGl1gGjHeNrzUeA%3D=0> > > -- > > To unsubscribe from the CCP4BB list, click the following link: > https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB=1 > To unsubscribe from the CCP4BB list, click the following link: https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB=1 This message was issued to members of www.jiscmail.ac.uk/CCP4BB, a mailing list hosted by www.jiscmail.ac.uk, terms & conditions are available at https://www.jiscmail.ac.uk/policyandsecurity/
[ccp4bb] AW: [ccp4bb] over-fitting? over-refinement?
A practice that was very popular before the Rfree came around was to fit a water molecule in every noise peak. One would get spectacular low Rfactors this way, but I cannot imagine that anyone would believe that this would be fitting and not over-fitting. Best, Herman Von: CCP4 bulletin board Im Auftrag von Sam Tang Gesendet: Dienstag, 20. Oktober 2020 05:27 An: CCP4BB@JISCMAIL.AC.UK Betreff: [ccp4bb] over-fitting? over-refinement? Hi, the question may be a bit weird, but how do you define 'over-fitting' in the context of structure refinement? From users' perspective the practical aspect is to 'fit' the model into the density. So there comes this question from our juniors: fit is fit, how is a model over-fit? BRS Sam To unsubscribe from the CCP4BB list, click the following link: https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB=1<https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.jiscmail.ac.uk%2Fcgi-bin%2FWA-JISC.exe%3FSUBED1%3DCCP4BB%26A%3D1=04%7C01%7CHerman.Schreuder%40SANOFI.COM%7Cfca18f01417745b3655008d874a81d74%7Caca3c8d6aa714e1aa10e03572fc58c0b%7C0%7C0%7C637387612965782189%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000=s1FpY1ufI7De3N6J2%2FivUy4zehp%2BcGl1gGjHeNrzUeA%3D=0> To unsubscribe from the CCP4BB list, click the following link: https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB=1 This message was issued to members of www.jiscmail.ac.uk/CCP4BB, a mailing list hosted by www.jiscmail.ac.uk, terms & conditions are available at https://www.jiscmail.ac.uk/policyandsecurity/
Re: [ccp4bb] over-fitting? over-refinement?
There was a comment on the bb from Ian Tickle last year which explains it very well, I think: " " " Rfree is not unbiased: as a measure of the agreement it is biased upwards by overfitting (otherwise how could it be used to detect overfitting?), by failing to fit with the uncorrelated errors in the test-set Fobs, just as Rwork is biased downwards by fitting to the errors in the working-set Fobs. Overfitting becomes immediately apparent whenever you perform any refinement, so the only point at which there is no overfitting is for the initial model when Rwork and Rfree are equal, apart from a small difference arising from random sampling of the test-set (that sampling error could be reduced by performing refinements with all 20 working/test sets combinations and averaging the R values). From there on the 'gap' between Rwork and Rfree is a measure of the degree of overfitting, so we should really be taking some average of Rwork and Rfree as the true measure of agreement (though the biases are not exactly equal and opposite so it's not a simple arithmetic mean). The goal of choosing the appropriate refinement parameters, restraints and weights is to _minimise_ overfitting, not eliminate it. It is not possible to eliminate it completely: if it were then Rwork and Rfree would become equal (apart from that small effect from random sampling). " " " That was Ian, not me, of course! Best wishes, Jon Cooper. jon.b.coo...@protonmail.com Original Message On 20 Oct 2020, 04:27, Sam Tang wrote: > Hi, the question may be a bit weird, but how do you define 'over-fitting' in > the context of structure refinement? From users' perspective the practical > aspect is to 'fit' the model into the density. So there comes this question > from our juniors: fit is fit, how is a model over-fit? > > BRS > > Sam > > --- > > To unsubscribe from the CCP4BB list, click the following link: > https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB=1 To unsubscribe from the CCP4BB list, click the following link: https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB=1 This message was issued to members of www.jiscmail.ac.uk/CCP4BB, a mailing list hosted by www.jiscmail.ac.uk, terms & conditions are available at https://www.jiscmail.ac.uk/policyandsecurity/
Re: [ccp4bb] over-fitting? over-refinement?
A related way of looking at things is saying that you model is over-fitted when you increase your model's precision without (noticeably) gaining accuracy. Ethan describes the cases in which you add a lot of parameters in you model. The test he describes in his paper works great, I use it all the time in pdb-redo. Pavel sent a lot of references to R-free which is used to test how predictive the model is. If R-free deviates too much from R, then apparently your model is not predictive enough for additional data (so it's too precise or not accurate enough). If we relate accuracy to R-free and the predictiveness of the model, then where does the precision com from if you do not change the number of model parameters? We always give coordinates and B-factors with the same precision in our models, don't we? Well yes, but we change other aspects of the model's precision by adding restraints (effectively improving the degrees of freedom of the model: Occam's Razor). For instance we have restraints to reduce the range that bond lengths can have in our model with respect to their "known" standard deviation. Or we reduce the differences between things we expect to be very similar with NCS restraints. That is why we validate models by looking at scores like the bond length rmsZ: we check whether the model is not too precise overall. If one bond is much longer or shorter than expected, this can still be right. If most of them are, then something is going on and your model may be too precise. Same goes for your Ramachandran plot, a single outlier may not be a problem, but if all residues are off a lot you should worry. That is why we have been advocating the Ramachandran Z-score (also see the recent paper by the Phenix and PDB-REDO teams). All of the things are related to the degrees of freedom of your system (obeservations - parameter + some_weight*restraints). Make sure you do not have too many parameters overall, improve the degrees of freedom by adding restraints. You can balance precision and accuracy by changing the weights on the restraints. And you check the accuracy by looking at the deviation of R and R-free. HTH, Robbie > -Original Message- > From: CCP4 bulletin board On Behalf Of Ethan A > Merritt > Sent: Tuesday, October 20, 2020 06:04 > To: CCP4BB@JISCMAIL.AC.UK > Subject: Re: [ccp4bb] over-fitting? over-refinement? > > On Monday, 19 October 2020 20:27:04 PDT Sam Tang wrote: > > Hi, the question may be a bit weird, but how do you define 'over-fitting' > > in the context of structure refinement? From users' perspective the > > practical aspect is to 'fit' the model into the density. So there > > comes this question from our juniors: fit is fit, how is a model over-fit? > > That is a good question, not asked as many times as it should be. > There are several validation techniques, tools, and indicators. > You are probably at least familiar with Rfree as an indicator. > But you are asking the deeper question "what is it that makes it over-fit". > > I suggest that the best starting point for thinking about it is Occam's Razor, > specifically the rephrasing by Albert Einstein: >"Everything should be made as simple as possible, > but no simpler." > > In applying this to considering a crystallographic model, that can be > translated as "the number of parameters used in the model should be as > small as possible, but no smaller". > > For example, if your structure is a homo-dimer you have a choice of > modelling each monomer separately or modelling only one monomer and > then describing how to generate the second by some symmetry operation. > Modeling each monomer independently will obviously require twice as many > parameters as modelling only one. > The guidance from Occam + Einstein is that the simpler (== smaller) model is > better, but only if it in fact adequately explains your observations. > > Ah, but how do you know if the simpler description is "adequate"? > That's where specific statistical tests and quality measures come in. > From hundreds of thousands of previous crystal structures we have a good > idea of what the R-factor for a good model is expected to be. > Does your simple model have a good R-factor? Good enough? > If you refine the more complicated (twice as big) model does it have a better > R-factor? If not then clearly all those extra parameters are useless and the > model is over-fit. > More typically the R-factor for the more complicated model will be a little > bit > better. But "a little bit" is not very convincing. > So we need some statistical test to ask if the model is > _significantly_ better. I won't delve into statistics here, > but that's the philosophical approach. > > I wrote a paper some year
Re: [ccp4bb] over-fitting? over-refinement?
Hi Sam, > Hi, the question may be a bit weird, but how do you define 'over-fitting' > in the context of structure refinement? From users' perspective the > practical aspect is to 'fit' the model into the density. So there comes > this question from our juniors: fit is fit, how is a model over-fit? > this is a good question for which there is an answer. I suggest reading classics on the matter: https://www.nature.com/articles/355472a0 https://atbweb.stanford.edu/atb_publications/brunger_kleywegt_struct_1996.pdf https://www.sciencedirect.com/science/article/pii/S0076687997770216 https://pubmed.ncbi.nlm.nih.gov/15299543/ https://journals.iucr.org/d/issues/1998/04/00/ad0030/ad0030.pdf and numerous references therein. That should set the scene for the next questions to ask. Good luck! Pavel To unsubscribe from the CCP4BB list, click the following link: https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB=1 This message was issued to members of www.jiscmail.ac.uk/CCP4BB, a mailing list hosted by www.jiscmail.ac.uk, terms & conditions are available at https://www.jiscmail.ac.uk/policyandsecurity/
Re: [ccp4bb] over-fitting? over-refinement?
On Monday, 19 October 2020 20:27:04 PDT Sam Tang wrote: > Hi, the question may be a bit weird, but how do you define 'over-fitting' > in the context of structure refinement? From users' perspective the > practical aspect is to 'fit' the model into the density. So there comes > this question from our juniors: fit is fit, how is a model over-fit? That is a good question, not asked as many times as it should be. There are several validation techniques, tools, and indicators. You are probably at least familiar with Rfree as an indicator. But you are asking the deeper question "what is it that makes it over-fit". I suggest that the best starting point for thinking about it is Occam's Razor, specifically the rephrasing by Albert Einstein: "Everything should be made as simple as possible, but no simpler." In applying this to considering a crystallographic model, that can be translated as "the number of parameters used in the model should be as small as possible, but no smaller". For example, if your structure is a homo-dimer you have a choice of modelling each monomer separately or modelling only one monomer and then describing how to generate the second by some symmetry operation. Modeling each monomer independently will obviously require twice as many parameters as modelling only one. The guidance from Occam + Einstein is that the simpler (== smaller) model is better, but only if it in fact adequately explains your observations. Ah, but how do you know if the simpler description is "adequate"? That's where specific statistical tests and quality measures come in. >From hundreds of thousands of previous crystal structures we have a good idea of what the R-factor for a good model is expected to be. Does your simple model have a good R-factor? Good enough? If you refine the more complicated (twice as big) model does it have a better R-factor? If not then clearly all those extra parameters are useless and the model is over-fit. More typically the R-factor for the more complicated model will be a little bit better. But "a little bit" is not very convincing. So we need some statistical test to ask if the model is _significantly_ better. I won't delve into statistics here, but that's the philosophical approach. I wrote a paper some years ago trying to lay this out as clearly as I could while focusing on a common choice made by crystallographers as to how to choose or refine B factors. The logic and statistical approach is valid for many other choices in model refinement. You can find a copy of the paper on my web site: E.A. Merritt (2012). "To B or not to B: a question of resolution?" Acta Cryst. D68, 468-477. http://skuld.bmsc.washington.edu/~tlsmd/ActaD_68_468.pdf cheers, Ethan > > BRS > > Sam -- Ethan A Merritt Biomolecular Structure Center, K-428 Health Sciences Bldg MS 357742, University of Washington, Seattle 98195-7742 To unsubscribe from the CCP4BB list, click the following link: https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB=1 This message was issued to members of www.jiscmail.ac.uk/CCP4BB, a mailing list hosted by www.jiscmail.ac.uk, terms & conditions are available at https://www.jiscmail.ac.uk/policyandsecurity/
[ccp4bb] over-fitting? over-refinement?
Hi, the question may be a bit weird, but how do you define 'over-fitting' in the context of structure refinement? From users' perspective the practical aspect is to 'fit' the model into the density. So there comes this question from our juniors: fit is fit, how is a model over-fit? BRS Sam To unsubscribe from the CCP4BB list, click the following link: https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB=1 This message was issued to members of www.jiscmail.ac.uk/CCP4BB, a mailing list hosted by www.jiscmail.ac.uk, terms & conditions are available at https://www.jiscmail.ac.uk/policyandsecurity/