I generally cut off integration at the shell wheree I/sigI < 0.5 and then cut off merged data where MnI/sd(I) ~ 1.5. It is always easier to to cut off data later than to re-integrate it. I never look at the Rmerge, Rsym, Rpim or Rwhatever in the highest resolution shell. This is because R-statistics are inappropriate for weak data.

Don't believe me? If anybody out there doesn't think that spots with no intensity are important, then why are you looking so carefully at your systematic absences? ;) The "Rabsent" statistic (if it existed) would always be dividing by zero, and giving wildly varying numbers > 100% (unless your "absences" really do have intensity, but then they are not absent, are they?).

There is information in the intensity of an "absent" spot (a systematic absence, or any spot beyond your "true resolution limit"). Unfortunately, measuring zero is "hard" because the "signal to noise ratio" will always be ... zero. Statistics as we know it seems to fear this noise>signal domain. For example, the error propagation Ulrich pointed out (F/sigF = 2 I/sigI) breaks down as I approaches zero. If you take F=0, and add random noise to it and then square it, you will get an average value for <I>=<F^2> that always equals the square of the noise you added. It will never be zero, no matter how much averaging you do. Going the other way is problematic because if <I> really is zero, then half of your measurments of it will be negative (and sqrt(I) will be "imaginary" (ha ha)). This is the problem TRUNCATE tries to solve.

Despite these difficulties, IMHO, cutting out weak data from a ML refinement is a really bad idea. This is because there is a big difference between "1 +/- 10" and "I don't know, could be anything" when you are fitting a model to data. ESPECIALLY when your data/parameters ratio is already ~1.0 or less. This is because the DIFFERENCE between Fobs and Fcalc relative to the uncertainty of Fobs is what determines wether or not your model is correct "to within experimental error". If weak, high-res data are left out, then they can become a dumping ground for model bias. Indeed, there are some entries in the PDB (particularly those pre-dating when we knew how to restrain B factors properly) that show an up-turn in "intensity" beyond the quoted resolution cutoff (if you look at the Wilson plot of Fcalc). This is because the refinement program was allowed to make Fcalc beyond the resolution cutoff anything it wanted (and it did). The only time I think cutting out data because it is weak is appropriate is for map calculations. Leaving out an HKL from the map is the same as assigning it to zero (unless it is a sigma-a map that "fills in" with Fcalcs). In maps, weak data (I/sd < 1) will (by definition) add more noise than signal. In fact, calculating an anomalous difference Patterson with DANO/SIGDANO as the coefficients instead of DANO can often lead to "better" maps.

Yes, your Rmerge, Rcryst and Rfree will all go up if you include weak data in your scaling and refinement, but the accuracy of your model will improve. If you (or your reviewer) are worried about this, I suggest using the old, traditional 3-sigma cutoff for data used to calculate R. Keep the anachronisms together. Yes, the PDB allows this. In fact, (last time I checked) you are asked to enter what sigma cutoff you used for your R factors.

In the last 100 days (3750 PDB depositions), the "REMARK 3 DATA CUTOFF" stats are thus:

sigma-cutoff  popularity
NULL          13.84%
NONE          13.65%
-2.5 to -1.5   0.37%
-0.5 to 0.5   62.48%
0.5 to 1.5     2.03%
1.5 to 2.5     6.51%
2.5 to 3.5     0.61%
3.5 to 4.5     0.24%
>4.5           0.27%

So it would appear mine is not a popular attitude.

-James Holton
MAD Scientist


Shane Atwell wrote:

Could someone point me to some standards for data quality, especially for publishing structures? I'm wondering in particular about highest shell completeness, multiplicity, sigma and Rmerge.

A co-worker pointed me to a '97 article by Kleywegt and Jones:

_http://xray.bmc.uu.se/gerard/gmrp/gmrp.html_

"To decide at which shell to cut off the resolution, we nowadays tend to use the following criteria for the highest shell: completeness > 80 %, multiplicity > 2, more than 60 % of the reflections with I > 3 sigma(I), and Rmerge < 40 %. In our opinion, it is better to have a good 1.8 Å structure, than a poor 1.637 Å structure."

Are these recommendations still valid with maximum likelihood methods? We tend to use more data, especially in terms of the Rmerge and sigma cuttoff.

Thanks in advance,

*Shane Atwell*

Reply via email to