Re: [ccp4bb] outliers

Dale Tronrud Tue, 08 Nov 2022 23:38:37 -0800

Let's say you have decided that you want to know if the CA-CB bondof residue 123 in your favorite protein differs from the expected valuefor that type of bond. You solve the structure and refine a modelagainst your crystallographic data, then look at residue's 123 CA-CBbond and find that it is 3 sigma from the expected value. Is thisobservation unlikely given the uncertainties in the parameters of the model?

Now, let's look at a different case. You have solved and refined amodel of your favorite protein. After examining all of 1000 bondlengths in your model you notice that the CA-CB bond of residue 123 is 3sigma from its expected value. Is this observation unlikely given theuncertainties in the parameters of the model?

Even though you are looking at the same bond in the same model andsee exactly the same thing, the calculation of the probability that thisbond is actually different than is usual it very different. Thecalculation that you want to perform - the classic p test based on aNormal distribution - is valid for the first case but is quiteinappropriate for the second.

It is clearly much more likely that, among 1000 bonds, one of themwill have a deviation of 3 sigma. In fact I would say it is a nearcertainty.

This twist of statistical analysis was never discussed in the basicclasses on stats that I took and most scientists tend to ignore it. Toavoid the apparent paradox that you are confronting you have to includein your calculations the consequences of the actual question you have asked.

There are huge problems with calculating this sort of "significance"because it is quite tempting to change your question after the fact andconclude that something is significant when it is not. TNT alwaysproduced a list of the geometry outliers after refinement. If younotice that a residue in the active site is present in that list, youwill be tempted to forget that this residue was brought to yourattention by a search over all geometry restraints and not a priorinterest in the active site.

This is a problem that many other fields of research are contendingwith. One solution is to publish the questions you hope your model willanswer before you perform the research. That is certainly difficultwith our sort of research.

An example from another area might be helpful. A researcherperforms a survey of a lot of people asking questions about their dietand about their medical history. Very often the published conclusionwill be that, say, dietary item number 5 is correlated with medicalcondition number 12. These studies tend to assess the significance ofthis result by just comparing the odds of these two items having theobserved magnitude of correlation.

This ignores the fact that a host of correlations were calculatedand only this one was "significant". If the survey had 20 dietaryfactors and 20 conditions then 400 comparisons were made and it was avirtual certainty that one of them would be "significant" unless theproper correction made to the probability calculations.


Dale E. Tronrud

On 11/8/2022 3:25 PM, James Holton wrote:

Thank you Ian for your quick response!
I suppose what I'm really trying to do is put a p-value on the"geometry" of a given PDB file. As in: what are the odds the deviationsfrom ideality of this model are due to chance?
I am leaning toward the need to take all the deviations in the structuretogether as a set, but, as Joao just noted, that it just "feels wrong"to tolerate a 3-sigma deviate. Even more wrong to tolerate 4 sigma, 5sigma. And 6 sigma deviates are really difficult to swallow unless yourhave trillions of data points.
To put it down in equations, is the p-value of a structure with 1000bonds in it with one 3-sigma deviate given by:
a)  p = 1-erf(3/sqrt(2))
or
b)  p = 1-erf(3/sqrt(2))**1000
or
c) something else?



On 11/8/2022 2:56 PM, Ian Tickle wrote:
Hi James
I don't think it's meaningful to ask whether the deviation of a singlebond length (or anything else that's single) from its expected valueis significant, since as you say there's always some finiteprobability that it occurred purely by chance. Statistics can onlymeaningfully be applied to samples of a 'reasonable' size. I knowthere are statistics designed for small samples but not for samples ofsize 1 ! It's more meaningful to talk about distributions. Forexample if 1% of the sample contained deviations > 3 sigma when youexpected there to be only 0.3 %, that is probably significant (but itstill has a finite probability of occurring by chance), as would befinding no deviations > 3 sigma (for a reasonably large sample toavoid sampling errors).
Cheers

-- Ian


On Tue, Nov 8, 2022, 22:22 James Holton <[email protected]> wrote:

    OK, so lets suppose there is this bond in your structure that is
    stretched a bit.  Is that for real? Or just a random fluke?  Let's
    say
    for example its a CA-CB bond that is supposed to be 1.529 A long,
    but in
    your model its 1.579 A.  This is 0.05 A too long. Doesn't seem like
    much, right? But the "sigma" given to such a bond in our geometry
    libraries is 0.016 A.  These sigmas are typically derived from a
    database of observed bonds of similar type found in highly accurate
    structures, like small molecules. So, that makes this a 3-sigma
    outlier.
    Assuming the distribution of deviations is Gaussian, that's a pretty
    unlikely thing to happen. You expect 3-sigma deviates to appear less
    than 0.3% of the time.  So, is that significant?

    But, then again, there are lots of other bonds in the structure. Lets
    say there are 1000. With that many samplings from a Gaussian
    distribution you generally expect to see a 3-sigma deviate at least
    once.  That is, do an "experiment" where you pick 1000
    Gaussian-random
    numbers from a distribution with a standard deviation of 1.0.
    Then, look
    for the maximum over all 1000 trials. Is that one > 3 sigma? It
    probably
    is. If you do this "experiment" millions of times it turns out
    seeing at
    least one 3-sigma deviate in 1000 tries is very common. Specifically,
    about 93% of the time. It is rare indeed to have every member of a
    1000-deviate set all lie within 3 sigmas.  So, we have gone from one
    3-sigma deviate being highly unlikely to being a virtual certainty if
    you look at enough samples.

    So, my question is: is a 3-sigma deviate significant?  Is it
    significant
    only if you have one bond in the structure?  What about angles?
    What if
    you have 500 bonds and 500 angles?  Do they count as 1000 deviates
    together? Or separately?

    I'm sure the more mathematically inclined out there will have some
    intelligent answers for the rest of us, however, if you are not a
    mathematician, how about a vote?  Is a 3-sigma bond length deviation
    significant? Or not?

    Looking forward to both kinds of responses,

    -James Holton
    MAD Scientist

    ########################################################################

    To unsubscribe from the CCP4BB list, click the following link:
    https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1
    <https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1>

    This message was issued to members of www.jiscmail.ac.uk/CCP4BB
    <http://www.jiscmail.ac.uk/CCP4BB>, a mailing list hosted by
    www.jiscmail.ac.uk <http://www.jiscmail.ac.uk>, terms & conditions
    are available at https://www.jiscmail.ac.uk/policyandsecurity/
------------------------------------------------------------------------

To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1<https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1>


########################################################################

To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1

This message was issued to members of www.jiscmail.ac.uk/CCP4BB, a mailing list 
hosted by www.jiscmail.ac.uk, terms & conditions are available at 
https://www.jiscmail.ac.uk/policyandsecurity/

Re: [ccp4bb] outliers

Reply via email to