Re: [ccp4bb] over-fitting? over-refinement?

Ethan A Merritt Mon, 19 Oct 2020 21:08:13 -0700

On Monday, 19 October 2020 20:27:04 PDT Sam Tang wrote:
> Hi, the question may be a bit weird, but how do you define 'over-fitting'
> in the context of structure refinement? From users' perspective the
> practical aspect is to 'fit' the model into the density. So there comes
> this question from our juniors: fit is fit, how is a model over-fit?


That is a good question, not asked as many times as it should be.
There are several validation techniques, tools, and indicators.
You are probably at least familiar with Rfree as an indicator.
But you are asking the deeper question "what is it that makes 
it over-fit".

I suggest that the best starting point for thinking about it is
Occam's Razor, specifically the rephrasing by Albert Einstein:
   "Everything should be made as simple as possible, 
    but no simpler."

In applying this to considering a crystallographic model, that
can be translated as "the number of parameters used in the model
should be as small as possible, but no smaller".

For example, if your structure is a homo-dimer you have a choice
of modelling each monomer separately or modelling only one monomer
and then describing how to generate the second by some symmetry
operation. Modeling each monomer independently will obviously require
twice as many parameters as modelling only one.
The guidance from Occam + Einstein is that the simpler (== smaller)
model is better, but only if it in fact adequately explains your
observations.  

Ah, but how do you know if the simpler description is "adequate"?
That's where specific statistical tests and quality measures come in.
>From hundreds of thousands of previous crystal structures we have
a good idea of what the R-factor for a good model is expected to be.
Does your simple model have a good R-factor?   Good enough?
If you refine the more complicated (twice as big) model does it
have a better R-factor?  If not then clearly all those extra
parameters are useless and the model is over-fit.
More typically the R-factor for the more complicated model will
be a little bit better.  But "a little bit" is not very convincing.
So we need some statistical test to ask if the model is
_significantly_ better.   I won't delve into statistics here,
but that's the philosophical approach.

I wrote a paper some years ago trying to lay this out as clearly
as I could while focusing on a common choice made by crystallographers
as to how to choose or refine B factors.  The logic and statistical
approach is valid for many other choices in model refinement.

You can find a copy of the paper on my web site:

E.A. Merritt (2012). "To B or not to B: a question of resolution?" 
Acta Cryst. D68, 468-477.
http://skuld.bmsc.washington.edu/~tlsmd/ActaD_68_468.pdf

        cheers,

                Ethan
> 
> BRS
> 
> Sam

-- 
Ethan A Merritt
Biomolecular Structure Center,  K-428 Health Sciences Bldg
MS 357742,   University of Washington, Seattle 98195-7742

########################################################################

To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1

This message was issued to members of www.jiscmail.ac.uk/CCP4BB, a mailing list 
hosted by www.jiscmail.ac.uk, terms & conditions are available at 
https://www.jiscmail.ac.uk/policyandsecurity/

Re: [ccp4bb] over-fitting? over-refinement?

Reply via email to