Re: [ccp4bb] Highest shell standards

James Holton Fri, 23 Mar 2007 18:13:37 -0800

I generally cut off integration at the shell wheree I/sigI < 0.5 andthen cut off merged data where MnI/sd(I) ~ 1.5. It is always easier toto cut off data later than to re-integrate it. I never look at theRmerge, Rsym, Rpim or Rwhatever in the highest resolution shell. Thisis because R-statistics are inappropriate for weak data.

Don't believe me? If anybody out there doesn't think that spots with nointensity are important, then why are you looking so carefully at yoursystematic absences? ;) The "Rabsent" statistic (if it existed) wouldalways be dividing by zero, and giving wildly varying numbers > 100%(unless your "absences" really do have intensity, but then they are notabsent, are they?).

There is information in the intensity of an "absent" spot (asystematic absence, or any spot beyond your "true resolution limit").Unfortunately, measuring zero is "hard" because the "signal to noiseratio" will always be ... zero. Statistics as we know it seems to fearthis noise>signal domain. For example, the error propagation Ulrichpointed out (F/sigF = 2 I/sigI) breaks down as I approaches zero. Ifyou take F=0, and add random noise to it and then square it, you willget an average value for <I>=<F^2> that always equals the square of thenoise you added. It will never be zero, no matter how much averagingyou do. Going the other way is problematic because if <I> really iszero, then half of your measurments of it will be negative (and sqrt(I)will be "imaginary" (ha ha)). This is the problem TRUNCATE tries to solve.

Despite these difficulties, IMHO, cutting out weak data from a MLrefinement is a really bad idea. This is because there is a bigdifference between "1 +/- 10" and "I don't know, could be anything" whenyou are fitting a model to data. ESPECIALLY when your data/parametersratio is already ~1.0 or less. This is because the DIFFERENCE betweenFobs and Fcalc relative to the uncertainty of Fobs is what determineswether or not your model is correct "to within experimental error". Ifweak, high-res data are left out, then they can become a dumping groundfor model bias. Indeed, there are some entries in the PDB (particularlythose pre-dating when we knew how to restrain B factors properly) thatshow an up-turn in "intensity" beyond the quoted resolution cutoff (ifyou look at the Wilson plot of Fcalc). This is because the refinementprogram was allowed to make Fcalc beyond the resolution cutoff anythingit wanted (and it did).The only time I think cutting out data because it is weak is appropriateis for map calculations. Leaving out an HKL from the map is the same asassigning it to zero (unless it is a sigma-a map that "fills in" withFcalcs). In maps, weak data (I/sd < 1) will (by definition) add morenoise than signal. In fact, calculating an anomalous differencePatterson with DANO/SIGDANO as the coefficients instead of DANO canoften lead to "better" maps.

Yes, your Rmerge, Rcryst and Rfree will all go up if you include weakdata in your scaling and refinement, but the accuracy of your model willimprove. If you (or your reviewer) are worried about this, I suggestusing the old, traditional 3-sigma cutoff for data used to calculate R.Keep the anachronisms together. Yes, the PDB allows this. In fact,(last time I checked) you are asked to enter what sigma cutoff you usedfor your R factors.

In the last 100 days (3750 PDB depositions), the "REMARK 3 DATACUTOFF" stats are thus:


sigma-cutoff  popularity
NULL          13.84%
NONE          13.65%
-2.5 to -1.5   0.37%
-0.5 to 0.5   62.48%
0.5 to 1.5     2.03%
1.5 to 2.5     6.51%
2.5 to 3.5     0.61%
3.5 to 4.5     0.24%
>4.5           0.27%

So it would appear mine is not a popular attitude.

-James Holton
MAD Scientist


Shane Atwell wrote:

Could someone point me to some standards for data quality, especiallyfor publishing structures? I'm wondering in particular about highestshell completeness, multiplicity, sigma and Rmerge.
A co-worker pointed me to a '97 article by Kleywegt and Jones:

_http://xray.bmc.uu.se/gerard/gmrp/gmrp.html_
"To decide at which shell to cut off the resolution, we nowadays tendto use the following criteria for the highest shell: completeness > 80%, multiplicity > 2, more than 60 % of the reflections with I > 3sigma(I), and Rmerge < 40 %. In our opinion, it is better to have agood 1.8 Å structure, than a poor 1.637 Å structure."
Are these recommendations still valid with maximum likelihood methods?We tend to use more data, especially in terms of the Rmerge and sigmacuttoff.
Thanks in advance,

*Shane Atwell*

Re: [ccp4bb] Highest shell standards

Reply via email to