Dear James,
I hope I will not disturb this thread and you will further continue with the 
theoretical aspects. But, what is the motivation for such a general question? 
In my crystal structures, I usually have many bonds outside the 3-sigma 
interval. Especially in the case of ligands. And not only for the low 
resolution structures. However, they are not reported as bond outliers in our 
validation programs. Would you suggest more geometry tied up refinement?
Regarding the last question in your contribution, I believe we should report 
RMSD together with the extremes in Table 1. I have seen a crystal structure 
where one significant outlier was “masked” by nearly perfect rest of the model.
Best regards,
Petr

Od: CCP4 bulletin board <[email protected]> za uživatele James Holton
Odesláno: Wednesday, November 9, 2022 12:51 AM
Komu: [email protected]
Předmět: Re: [ccp4bb] outliers

Thank you for this.

Hmmm.

Interesting, and good to know the expected distribution of extreme values.

However, what I'm more worried about is how to evaluate the other 999 points?  
Lets say I'm trying to compare two 1000-member sets (A and B) that both have an 
extreme value of 3, but for the other 999 they are all 2sigma in "A" and 1sigma 
in B.  Clearly, "B" is better than "A", but how to quantify?

On 11/8/2022 3:34 PM, Petrus Zwart wrote:
Hi James,

This is what you need.

https://en.wikipedia.org/wiki/Generalized_extreme_value_distribution

The distribution of a maximum of 1k random variates looks like this, and the 
(fitted by eye) analytical distribution associated with it seems to have a 
decent fit - as expected.

[image.png]

The idea of a p-value to judge the quality of a structure is interesting. 
xtriage uses this mechanism to flag suspicious normalized intensities, the idea 
being that in a small dataset it is less likely to see a large E value as 
compared to in a large dataset.
The issue of course is that the total intensity of a normalized intensity is 
bound by the number of atoms and the underlying assumption used is that it can 
be potentially infinitely large. It still is a decent metric I think.

P


P


On Tue, Nov 8, 2022 at 3:25 PM James Holton 
<[email protected]<mailto:[email protected]>> wrote:
Thank you Ian for your quick response!

I suppose what I'm really trying to do is put a p-value on the "geometry" of a 
given PDB file.  As in: what are the odds the deviations from ideality of this 
model are due to chance?

I am leaning toward the need to take all the deviations in the structure 
together as a set, but, as Joao just noted, that it just "feels wrong" to 
tolerate a 3-sigma deviate.  Even more wrong to tolerate 4 sigma, 5 sigma. And 
6 sigma deviates are really difficult to swallow unless your have trillions of 
data points.

To put it down in equations, is the p-value of a structure with 1000 bonds in 
it with one 3-sigma deviate given by:

a)  p = 1-erf(3/sqrt(2))
or
b)  p = 1-erf(3/sqrt(2))**1000
or
c) something else?


On 11/8/2022 2:56 PM, Ian Tickle wrote:
Hi James

I don't think it's meaningful to ask whether the deviation of a single bond 
length (or anything else that's single) from its expected value is significant, 
since as you say there's always some finite probability that it occurred purely 
by chance.  Statistics can only meaningfully be applied to samples of a 
'reasonable' size.  I know there are statistics designed for small samples but 
not for samples of size 1 !  It's more meaningful to talk about distributions.  
For example if 1% of the sample contained deviations > 3 sigma when you 
expected there to be only 0.3 %, that is probably significant (but it still has 
a finite probability of occurring by chance), as would be finding no deviations 
> 3 sigma (for a reasonably large sample to avoid sampling errors).

Cheers

-- Ian

On Tue, Nov 8, 2022, 22:22 James Holton 
<[email protected]<mailto:[email protected]>> wrote:
OK, so lets suppose there is this bond in your structure that is
stretched a bit.  Is that for real? Or just a random fluke?  Let's say
for example its a CA-CB bond that is supposed to be 1.529 A long, but in
your model its 1.579 A.  This is 0.05 A too long. Doesn't seem like
much, right? But the "sigma" given to such a bond in our geometry
libraries is 0.016 A.  These sigmas are typically derived from a
database of observed bonds of similar type found in highly accurate
structures, like small molecules. So, that makes this a 3-sigma outlier.
Assuming the distribution of deviations is Gaussian, that's a pretty
unlikely thing to happen. You expect 3-sigma deviates to appear less
than 0.3% of the time.  So, is that significant?

But, then again, there are lots of other bonds in the structure. Lets
say there are 1000. With that many samplings from a Gaussian
distribution you generally expect to see a 3-sigma deviate at least
once.  That is, do an "experiment" where you pick 1000 Gaussian-random
numbers from a distribution with a standard deviation of 1.0. Then, look
for the maximum over all 1000 trials. Is that one > 3 sigma? It probably
is. If you do this "experiment" millions of times it turns out seeing at
least one 3-sigma deviate in 1000 tries is very common. Specifically,
about 93% of the time. It is rare indeed to have every member of a
1000-deviate set all lie within 3 sigmas.  So, we have gone from one
3-sigma deviate being highly unlikely to being a virtual certainty if
you look at enough samples.

So, my question is: is a 3-sigma deviate significant?  Is it significant
only if you have one bond in the structure?  What about angles? What if
you have 500 bonds and 500 angles?  Do they count as 1000 deviates
together? Or separately?

I'm sure the more mathematically inclined out there will have some
intelligent answers for the rest of us, however, if you are not a
mathematician, how about a vote?  Is a 3-sigma bond length deviation
significant? Or not?

Looking forward to both kinds of responses,

-James Holton
MAD Scientist

########################################################################

To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1

This message was issued to members of 
www.jiscmail.ac.uk/CCP4BB<http://www.jiscmail.ac.uk/CCP4BB>, a mailing list 
hosted by www.jiscmail.ac.uk<http://www.jiscmail.ac.uk>, terms & conditions are 
available at https://www.jiscmail.ac.uk/policyandsecurity/


________________________________

To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1


--
------------------------------------------------------------------------------------------
P.H. Zwart
Staff Scientist, Molecular Biophysics and Integrated Bioimaging
Biosciences Lead, Center for Advanced Mathematics for Energy Research 
Applications
Lawrence Berkeley National Laboratories
1 Cyclotron Road, Berkeley, CA-94703, USA
Cell: 510 289 9246
PHENIX:   http://www.phenix-online.org<http://www.phenix-online.org/>
CAMERA: http://camera.lbl.gov/
------------------------------------------------------------------------------------------


________________________________

To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1

########################################################################

To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1

This message was issued to members of www.jiscmail.ac.uk/CCP4BB, a mailing list 
hosted by www.jiscmail.ac.uk, terms & conditions are available at 
https://www.jiscmail.ac.uk/policyandsecurity/

Reply via email to