Dear James, I hope I will not disturb this thread and you will further continue with the theoretical aspects. But, what is the motivation for such a general question? In my crystal structures, I usually have many bonds outside the 3-sigma interval. Especially in the case of ligands. And not only for the low resolution structures. However, they are not reported as bond outliers in our validation programs. Would you suggest more geometry tied up refinement? Regarding the last question in your contribution, I believe we should report RMSD together with the extremes in Table 1. I have seen a crystal structure where one significant outlier was “masked” by nearly perfect rest of the model. Best regards, Petr
Od: CCP4 bulletin board <[email protected]> za uživatele James Holton Odesláno: Wednesday, November 9, 2022 12:51 AM Komu: [email protected] Předmět: Re: [ccp4bb] outliers Thank you for this. Hmmm. Interesting, and good to know the expected distribution of extreme values. However, what I'm more worried about is how to evaluate the other 999 points? Lets say I'm trying to compare two 1000-member sets (A and B) that both have an extreme value of 3, but for the other 999 they are all 2sigma in "A" and 1sigma in B. Clearly, "B" is better than "A", but how to quantify? On 11/8/2022 3:34 PM, Petrus Zwart wrote: Hi James, This is what you need. https://en.wikipedia.org/wiki/Generalized_extreme_value_distribution The distribution of a maximum of 1k random variates looks like this, and the (fitted by eye) analytical distribution associated with it seems to have a decent fit - as expected. [image.png] The idea of a p-value to judge the quality of a structure is interesting. xtriage uses this mechanism to flag suspicious normalized intensities, the idea being that in a small dataset it is less likely to see a large E value as compared to in a large dataset. The issue of course is that the total intensity of a normalized intensity is bound by the number of atoms and the underlying assumption used is that it can be potentially infinitely large. It still is a decent metric I think. P P On Tue, Nov 8, 2022 at 3:25 PM James Holton <[email protected]<mailto:[email protected]>> wrote: Thank you Ian for your quick response! I suppose what I'm really trying to do is put a p-value on the "geometry" of a given PDB file. As in: what are the odds the deviations from ideality of this model are due to chance? I am leaning toward the need to take all the deviations in the structure together as a set, but, as Joao just noted, that it just "feels wrong" to tolerate a 3-sigma deviate. Even more wrong to tolerate 4 sigma, 5 sigma. And 6 sigma deviates are really difficult to swallow unless your have trillions of data points. To put it down in equations, is the p-value of a structure with 1000 bonds in it with one 3-sigma deviate given by: a) p = 1-erf(3/sqrt(2)) or b) p = 1-erf(3/sqrt(2))**1000 or c) something else? On 11/8/2022 2:56 PM, Ian Tickle wrote: Hi James I don't think it's meaningful to ask whether the deviation of a single bond length (or anything else that's single) from its expected value is significant, since as you say there's always some finite probability that it occurred purely by chance. Statistics can only meaningfully be applied to samples of a 'reasonable' size. I know there are statistics designed for small samples but not for samples of size 1 ! It's more meaningful to talk about distributions. For example if 1% of the sample contained deviations > 3 sigma when you expected there to be only 0.3 %, that is probably significant (but it still has a finite probability of occurring by chance), as would be finding no deviations > 3 sigma (for a reasonably large sample to avoid sampling errors). Cheers -- Ian On Tue, Nov 8, 2022, 22:22 James Holton <[email protected]<mailto:[email protected]>> wrote: OK, so lets suppose there is this bond in your structure that is stretched a bit. Is that for real? Or just a random fluke? Let's say for example its a CA-CB bond that is supposed to be 1.529 A long, but in your model its 1.579 A. This is 0.05 A too long. Doesn't seem like much, right? But the "sigma" given to such a bond in our geometry libraries is 0.016 A. These sigmas are typically derived from a database of observed bonds of similar type found in highly accurate structures, like small molecules. So, that makes this a 3-sigma outlier. Assuming the distribution of deviations is Gaussian, that's a pretty unlikely thing to happen. You expect 3-sigma deviates to appear less than 0.3% of the time. So, is that significant? But, then again, there are lots of other bonds in the structure. Lets say there are 1000. With that many samplings from a Gaussian distribution you generally expect to see a 3-sigma deviate at least once. That is, do an "experiment" where you pick 1000 Gaussian-random numbers from a distribution with a standard deviation of 1.0. Then, look for the maximum over all 1000 trials. Is that one > 3 sigma? It probably is. If you do this "experiment" millions of times it turns out seeing at least one 3-sigma deviate in 1000 tries is very common. Specifically, about 93% of the time. It is rare indeed to have every member of a 1000-deviate set all lie within 3 sigmas. So, we have gone from one 3-sigma deviate being highly unlikely to being a virtual certainty if you look at enough samples. So, my question is: is a 3-sigma deviate significant? Is it significant only if you have one bond in the structure? What about angles? What if you have 500 bonds and 500 angles? Do they count as 1000 deviates together? Or separately? I'm sure the more mathematically inclined out there will have some intelligent answers for the rest of us, however, if you are not a mathematician, how about a vote? Is a 3-sigma bond length deviation significant? Or not? Looking forward to both kinds of responses, -James Holton MAD Scientist ######################################################################## To unsubscribe from the CCP4BB list, click the following link: https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1 This message was issued to members of www.jiscmail.ac.uk/CCP4BB<http://www.jiscmail.ac.uk/CCP4BB>, a mailing list hosted by www.jiscmail.ac.uk<http://www.jiscmail.ac.uk>, terms & conditions are available at https://www.jiscmail.ac.uk/policyandsecurity/ ________________________________ To unsubscribe from the CCP4BB list, click the following link: https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1 -- ------------------------------------------------------------------------------------------ P.H. Zwart Staff Scientist, Molecular Biophysics and Integrated Bioimaging Biosciences Lead, Center for Advanced Mathematics for Energy Research Applications Lawrence Berkeley National Laboratories 1 Cyclotron Road, Berkeley, CA-94703, USA Cell: 510 289 9246 PHENIX: http://www.phenix-online.org<http://www.phenix-online.org/> CAMERA: http://camera.lbl.gov/ ------------------------------------------------------------------------------------------ ________________________________ To unsubscribe from the CCP4BB list, click the following link: https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1 ######################################################################## To unsubscribe from the CCP4BB list, click the following link: https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1 This message was issued to members of www.jiscmail.ac.uk/CCP4BB, a mailing list hosted by www.jiscmail.ac.uk, terms & conditions are available at https://www.jiscmail.ac.uk/policyandsecurity/
