Re: [ccp4bb] over-fitting? over-refinement?

Robbie Joosten Mon, 19 Oct 2020 23:01:01 -0700

A related way of looking at things is saying that you model is over-fitted when 
you increase your model's precision without (noticeably) gaining accuracy.

Ethan describes the cases in which you add a lot of parameters in you model. 
The test he describes in his paper works great, I use it all the time in 
pdb-redo. Pavel sent a lot of references to R-free which is used to test how 
predictive the model is. If R-free deviates too much from R, then apparently 
your model is not predictive enough for additional data (so it's too precise or 
not accurate enough). If we relate accuracy to R-free and the predictiveness of 
the model, then where does the precision com from if you do not change the 
number of model parameters? We always give coordinates and B-factors with the 
same precision in our models, don't we? Well yes, but we change other aspects 
of the model's precision by adding restraints (effectively improving the 
degrees of freedom of the model: Occam's Razor). For instance we have 
restraints to reduce the range that bond lengths can have in our model with 
respect to their "known" standard deviation. Or we reduce the differences 
between things we expect to be very similar with NCS restraints. That is why we 
validate models by looking at scores like the bond length rmsZ: we check 
whether the model is not too precise overall. If one bond is much longer or 
shorter than expected, this can still be right. If most of them are, then 
something is going on and your model may be too precise. Same goes for your 
Ramachandran plot, a single outlier may not be a problem, but if all residues 
are off a lot you should worry. That is why we have been advocating the 
Ramachandran Z-score (also see the recent paper by the Phenix and PDB-REDO 
teams).

All of the things are related to the degrees of freedom of your system 
(obeservations - parameter + some_weight*restraints). Make sure you do not have 
too many parameters overall, improve the degrees of freedom by adding 
restraints. You can balance precision and accuracy by changing the weights on 
the restraints. And you check the accuracy by looking at the deviation of R and 
R-free.

HTH,
Robbie 

> -----Original Message-----
> From: CCP4 bulletin board <CCP4BB@JISCMAIL.AC.UK> On Behalf Of Ethan A
> Merritt
> Sent: Tuesday, October 20, 2020 06:04
> To: CCP4BB@JISCMAIL.AC.UK
> Subject: Re: [ccp4bb] over-fitting? over-refinement?
> 
> On Monday, 19 October 2020 20:27:04 PDT Sam Tang wrote:
> > Hi, the question may be a bit weird, but how do you define 'over-fitting'
> > in the context of structure refinement? From users' perspective the
> > practical aspect is to 'fit' the model into the density. So there
> > comes this question from our juniors: fit is fit, how is a model over-fit?
> 
> That is a good question, not asked as many times as it should be.
> There are several validation techniques, tools, and indicators.
> You are probably at least familiar with Rfree as an indicator.
> But you are asking the deeper question "what is it that makes it over-fit".
> 
> I suggest that the best starting point for thinking about it is Occam's Razor,
> specifically the rephrasing by Albert Einstein:
>    "Everything should be made as simple as possible,
>     but no simpler."
> 
> In applying this to considering a crystallographic model, that can be
> translated as "the number of parameters used in the model should be as
> small as possible, but no smaller".
> 
> For example, if your structure is a homo-dimer you have a choice of
> modelling each monomer separately or modelling only one monomer and
> then describing how to generate the second by some symmetry operation.
> Modeling each monomer independently will obviously require twice as many
> parameters as modelling only one.
> The guidance from Occam + Einstein is that the simpler (== smaller) model is
> better, but only if it in fact adequately explains your observations.
> 
> Ah, but how do you know if the simpler description is "adequate"?
> That's where specific statistical tests and quality measures come in.
> From hundreds of thousands of previous crystal structures we have a good
> idea of what the R-factor for a good model is expected to be.
> Does your simple model have a good R-factor?   Good enough?
> If you refine the more complicated (twice as big) model does it have a better
> R-factor?  If not then clearly all those extra parameters are useless and the
> model is over-fit.
> More typically the R-factor for the more complicated model will be a little 
> bit
> better.  But "a little bit" is not very convincing.
> So we need some statistical test to ask if the model is
> _significantly_ better.   I won't delve into statistics here,
> but that's the philosophical approach.
> 
> I wrote a paper some years ago trying to lay this out as clearly as I could
> while focusing on a common choice made by crystallographers as to how to
> choose or refine B factors.  The logic and statistical approach is valid for
> many other choices in model refinement.
> 
> You can find a copy of the paper on my web site:
> 
> E.A. Merritt (2012). "To B or not to B: a question of resolution?"
> Acta Cryst. D68, 468-477.
> http://skuld.bmsc.washington.edu/~tlsmd/ActaD_68_468.pdf
> 
>       cheers,
> 
>               Ethan
> >
> > BRS
> >
> > Sam
> 
> --
> Ethan A Merritt
> Biomolecular Structure Center,  K-428 Health Sciences Bldg
> MS 357742,   University of Washington, Seattle 98195-7742
> 
> ###############################################################
> #########
> 
> To unsubscribe from the CCP4BB list, click the following link:
> https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1
> 
> This message was issued to members of www.jiscmail.ac.uk/CCP4BB, a
> mailing list hosted by www.jiscmail.ac.uk, terms & conditions are available at
> https://www.jiscmail.ac.uk/policyandsecurity/

########################################################################

To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1

This message was issued to members of www.jiscmail.ac.uk/CCP4BB, a mailing list 
hosted by www.jiscmail.ac.uk, terms & conditions are available at 
https://www.jiscmail.ac.uk/policyandsecurity/

Re: [ccp4bb] over-fitting? over-refinement?

Reply via email to