Re: [ccp4bb] Rmerge - was moelcular replacement with large cell

James Holton Wed, 15 Jul 2009 11:49:53 -0700

Actually, if I/sd < 3, Rmerge, Rpim, Rrim, etc. are all infinity.Doesn't matter what your redundancy is.

Don't believe me? Try it.The extreme case is I/sd = 0, and as long as there is some background(and, let's face it, there always is), the "observed" spot intensitywill be equally likely to be positive or negative, with a (basically)Gaussian distribution.So, if you generate say, ten Gaussian-random numbers (centered on zero),take their average value , compute the average deviation from thataverage <|I-|>, and then divide <|I-|>/, you will get the"Rmerge" expected for I/sd = 0 at a redundancy of 10. Problem is, ifyou do this again with a different random number seed, you will get avery different Rmerge. Even if you do it with a million differentrandom number seeds and compute the "average Rmerge", you will alwaysget wildly different values. Some positive, some negative. And itdoesn't matter how many "data points" you use to compute the Rmerge:averaging a million Rmerge values will give a different answer thanaveraging a million and one.

The reason for this numerical instability is because both and<|I-|> follow a Gaussian distribution that is centered at zero, andthe ratio of two numbers like this has a Lorentzian distribution. TheLorentzian looks a lot like a Gaussian, but has much fatter tails. Fatenough so that the Lorentzian distribution has NO MEAN VALUE.Seriously. It is hard to believe that the average value of somethingthat is equally likely to be positive or negative could be anything butzero, but for all practical purposes you can never arrive at the averagevalue of something with a Lorentzian distribution. At least not bytaking finite samples. So, no matter what the redundancy, you willalways get a different Rmerge.

However, if is not centered on zero (I/sd > 0), then the ratio ofthe two Gaussian-random numbers starts to look like a Gaussian itself,and this distribution does have a mean value (Rmerge will be"reproducible"). However, this does not happen all at once. The tailsstart to shrink as I/sd = 1, they are even smaller at I/sd = 2, and thedistribution finally looses all "Lorentzian character" when I/sd >= 3.Only then is Rmerge a meaningful quantity.

So, perhaps our "forefathers" who first instituted the practice of a3-sigma cutoff for all intensities actually DID know what they weredoing! All R- statistics (including Rcryst and Rfree) are unstable inthis way for weak data, but sometime in the early 1990s the practice ofcomputing R-factors on "all data" crept into the field. I'm not sayingwe should not use all data, maximum likelihood refinement uses sigmasproperly and "weak" data are powerful restraints. However, I will go onrecord as suggesting that a 3-sigma cutoff should be used for all Rstatistics. There is still a place in your PDB file to put the sigmacutoff you used for your R factors.


-James Holton
MAD Scientist


Lijun Liu wrote:

Hi Frank,
Off from the original topic but important to clarify. If I misled theconcepts, I apologize.
Outer shell Rmerge will always be very high:
----------
True! Especially when I/Sig ~ 1 or less.
Only I/sigI (and completeness, although it's related) is reallyrelevant for deciding high resolution cutoff.
---------
Normally I useI/Sig = 2.0 for res-cut-off. For this "accuracy"---please do not askme the exact meaning of Sig(too many contributed this includinghardware, software, protocol, strategies,...), theaverage measuring error for reflections could be expected to theinversion of thisnumber, 1/2.0, i.e. 50%, which in general suggests that the Rmerge should not pass much this value to make the inclusion of the data meaningful. (Pleaseread this carefully since I do not want to confuse two differentconcepts). Or you are merging data with merging error much largerthan the data measuring error. Although the estimation of Sig(I) isdifficult and Sig(I) itself may be of largeerror, when I/sig ~ 3, 70% seems still to be too high to accept.
Rmerge is well known to be a weak indicator, but it is not just a mathematical issue,and never a crap. It should be used with others (I/S, red, ...). Iagree with Ian that all data should be included, if the quality isguaranteed.
I did not comb the history of refinement softwares and theirphilosophy, but today it seems all the prevailing ref-packages useresolution bins for shelling (I know there has been enough theoreticalground to to so), which is the source of RESOLUTION CUTOFF and someproblems arisen from RESOLUTION CUTOFF for example the Rmerge issue.I appreciate to be told if some softwares had ever used I, I/SigI, F,F/SigF or something else for binning, especially in the early time forrefinement package development. RESOLUTION BINNING might not be ahave-to? :D
Best regards.

Lijun Liu, PhD
http://www.uoregon.edu/~liulj/ <http://www.uoregon.edu/%7Eliulj/>

Re: [ccp4bb] Rmerge - was moelcular replacement with large cell

Reply via email to