Re: [ccp4bb] Rmergicide Through Programming

James Holton Sat, 08 Jul 2017 09:03:43 -0700

Sorry for the confusion.  I was going for brevity!  And failed.

I know that the multiplicity correction is applied on a per-hkl basis inthe calculation of Rmeas. However, the average multiplicity over thewhole calculation is most likely not an integer. Some hkls may beobserved twice while others only once, or perhaps 3-4 times in the samescaling run.


Allow me to do the error propagation properly.  Consider the scenario:

Your outer resolution bin has a true I/sigma = 1.00 and averagemultiplicity of 2.0. Let's say there are 100 hkl indices in this bin. Ichoose the "true" intensities of each hkl from an exponential (akaWilson) distribution. Further assume the background is high, so theerror in each observation after background subtraction may be taken froma Gaussian distribution. Let's further choose the per-hkl multiplicityfrom a Poisson distribution with expectation value 2.0, so 0 ispossible, but the long-term average multiplicity is 2.0. For Rcalculation, when multiplicity of any given hkl is less than 2 it isskipped. What I end up with after 120,000 trials is a distribution ofvalues for each R factor. See attached graph.

What I hope is readily apparent is that the distribution of Rmergevalues is taller and sharper than that of the Rmeas values. The mostlikely Rmeas is 80% and that of Rmerge is 64.6%. This is expected, ofcourse. But what I hope to impress upon you is that the most likelyvalue is not generally the one that you will get! The distribution has awidth. Specifically, Rmeas could be as low as 40%, or as high as 209%,depending on the trial. Half of the trial results falling between 71.4%and 90.3%, a range of 19 percentage points. Rmerge has a middle-halfrange from 57.6% to 72.9% (15.3 percentage points). This range ofpossible values of Rmerge or Rmeas from data with the same intrinsicquality is what I mean when I say "numerical instability". Each andevery trial had the same true I/sigma and multiplicity, and yet the Rfactors I get vary depending on the trial. Unfortunately for most of uswith real data, you only ever get one trial, and you can't predict whichRmeas or Rmerge you'll get.

My point here is that R statistics in general are not comparable fromexperiment to experiment when you are looking at data with low averageintensity and low multiplicity, and it appears that Rmeas is less stablethan Rmerge. Not by much, mind you, but still jumps around more.


Hope that is clearer?

Note that in no way am I suggesting that low-multiplicity is the rightway to collect data. Far from it. Especially with modern detectorsthat have negligible read-out noise. But when micro crystals only giveoff a handful of photons each before they die, low multiplicity might beall you have.


-James Holton
MAD Scientist



On 7/7/2017 2:33 PM, Edward A. Berry wrote:

I think the confusion here is that the "multiplicity correction" isappliedon each reflection, where it will be an integer 2 or greater (can'testimatevariance with only one measurement). You can only correct in anapproximateway using using the average multiplicity of the dataset, since itwould depend
on the distribution of multiplicity over the reflections.

And the correction is for r-merge. You don't need to apply a correction
to R-meas.
R-meas is a redundancy-independent best estimate of the variance.
Whatever you would have used R-merge for (hopefully taking allowance
for the multiplicity) you can use R-meas and not worry aboutmultiplicity.
Again, what information does R-merge provide that R-meas does not provide
in a more accurate way?

According to the denso manual, one way to artificially reduce
R-merge is to include reflections with only one measure (averaging
in a lot of zero's always helps bring an average down), and they say
there were actually some programs that did that. However I'm
quite sure none of the ones we rely on today do that.

On 07/07/2017 03:12 PM, Kay Diederichs wrote:
James,
I cannot follow you. "n approaches 1" can only mean n = 2 because nis integer. And for n=2 the sqrt(n/(n-1)) factor is well-defined. Forn=1, neither contributions to Rmeas nor Rmerge nor to any otherprecision indicator can be calculated anyway, because there's nothingthis measurement can be compared against.
just my 2 cents,

Kay
On Fri, 7 Jul 2017 10:57:17 -0700, James Holton<jmhol...@slac.stanford.edu> wrote:
I happen to be one of those people who think Rmerge is a very useful
statistic. Not as a method of evaluating the resolution limit,which is
mathematically ridiculous, but for a host of other important things,
like evaluating the performance of data collection equipment, and
evaluating the isomorphism of different crystals, to name a few.
I like Rmerge because it is a simple statistic that has a simpleformula
and has not undergone any "corrections".  Corrections increase
complexity, and complexity opens the door to manipulation by the
desperate and/or misguided.  For example, overzealous outlier rejection
is a common way to abuse R factors, and it is far too often swept under
the rug, sometimes without the user even knowing about it. This is
especially problematic when working in a regime where the statistic of
interest is unstable, and for R factors this is low intensity data.
Rejecting just the right "outliers" can make any R factor look a lot
better.  Why would Rmeas be any more unstable than Rmerge? Look at the
formula. There is an "n-1" in the denominator, where n is the
multiplicity. So, what happens when n approaches 1 ? What happenswhen
n=1? This is not to say Rmerge is better than Rmeas. In fact, I believe
the latter is generally superior to the first, unless you are working
near n = 1. The sqrt(n/(n-1)) is trying to correct for bias in the R
statistic, but fighting one infinity with another infinity is a
dangerous game.
My point is that neither Rmerge nor Rmeas are easily interpretedwithout
knowing the multiplicity.  If you see Rmeas = 10% and the multiplicity
is 10, then you know what that means.  Same for Rmerge, since at n=10
both stats have nearly the same value.  But if you have Rmeas = 45% and
multiplicity = 1.05, what does that mean? Rmeas will be only 33% ifthe
multiplicity is rounded up to 1.1. This is what I mean by "numerical
instability", the value of the R statistic itself becomes sensitive to
small amounts of noise, and behaves more and more like a random number
generator. And if you have Rmeas = 33% and no indication of
multiplicity, it is hard to know what is going on.  I personally am a
lot more comfortable seeing qualitative agreement between Rmerge and
Rmeas, because that means the numerical instability of the multiplicity
correction didn't mess anything up.

Of course, when the intensity is weak R statistics in general are not
useful.  Both Rmeas and Rmerge have the sum of all intensities in the
denominator, so when the bin-wide sum approaches zero you have another
infinity to contend with.  This one starts to rear its ugly head once
I/sigma drops below about 3, and this is why our ancestors always
applied a sigma cutoff before computing an R factor. Oursmall-molecule
colleagues still do this!  They call it "R1".  And it is an excellent
indicator of the overall relative error.  The relative error in the
outermost bin is not meaningful, and strangely enough nobody ever
reported the outer-resolution Rmerge before 1995.

For weak signals, Correlation Coefficients are better, but for strong
signals CC pegs out at >95%, making it harder to see relative errors.
I/sigma is what we'd like to know, but the value of "sigma" is still
prone to manipulation by not just outlier rejection, but massaging the
so-called "error model".  Suffice it to say, crystallographic data
contain more than one type of error.  Some sources are important for
weak spots, others are important for strong spots, and still others are
only apparent in the mid-range.  Some sources of error are only
important at low multiplicity, and others only manifest at high
multiplicity. There is no single number that can be used to evaluateall
aspects of data quality.
So, I remain a champion of reporting Rmerge. Not in the high-anglebin,
because that is essentially a random number, but overall Rmerge and
low-angle-bin Rmerge next to multiplicity, Rmeas, CC1/2 and other
statistics is the only way you can glean enough information about where
the errors are coming from in the data.  Rmeas is a useful addition
because it helps us correct for multiplicity without having to do math
in our head.  Users generally thank you for that. Rmerge, however, has
served us well for more than half a century, and I believe Uli Arndt
knew what he was doing.  I hope we all know enough about history to
realize that future generations seldom thank their ancestors for
"protecting" them from information.

-James Holton
MAD Scientist


On 7/5/2017 10:36 AM, Graeme Winter wrote:
Frank,
you are asking me to remove features that I like, so I would feelthat the challenge is for you to prove that this is harmful however:
- at the minimum, I find it a useful check sum that the statsare internally consistent (though I interpret it for lots of otherreasons too)
   - it is faulty I agree, but (with caveats) still useful IMHO
Sorry for being terse, but I remain to be convinced that removingit increases the amount of information
CC’ing BB as requested

Best wishes Graeme
On 5 Jul 2017, at 17:17, Frank von Delft<frank.vonde...@sgc.ox.ac.uk> wrote:
You keep not answering the challenge.
It's really simple: what information does Rmerge provide thatRmeas doesn't.
(If you answer, email to the BB.)


On 05/07/2017 16:04, graeme.win...@diamond.ac.uk wrote:
Dear Frank,
You are forcefully arguing essentially that others are wrong ifwe feel an existing statistic continues to be useful, and insteadinsist that it be outlawed so that we may not make use of it,just in case someone misinterprets it.
Very well
I do however express disquiet that we as software developers feelbrowbeaten to remove the output we find useful because “thecommunity” feel that it is obsolete.
I feel that Jacob’s short story on this thread illustrates thateducating the next generation of crystallographers to understandwhat all of the numbers mean is critical, and that anumerological approach of trying to optimise any one statistic isessentially doomed. Precisely the same argument could be made forpeople cutting the “resolution” at the wrong place in order toimprove the average I/sig(I) of the data set.
Denying access to information is not a solution tomisinterpretation, from where I am sat, however I acknowledgethat other points of view exist.
Best wishes Graeme
On 5 Jul 2017, at 12:11, Frank von Delft<frank.vonde...@sgc.ox.ac.uk<mailto:frank.vonde...@sgc.ox.ac.uk>>wrote:
Graeme, Andrew
Jacob is not arguing against an R-based statistic; he's pointingout that leaving out the multiplicity-weighting is prehistoric(Diederichs & Karplus published it 20 years ago!).
So indeed: Rmerge, Rpim and I/sigI give different information.As you say.
But no: Rmerge and Rmeas and Rcryst do NOT give differentinformation. Except:
    * Rmerge is a (potentially) misleading version of Rmeas.
* Rcryst and Rmerge and Rsym are terms that no longer havesignificance in the single cryo-dataset world.
phx.



On 05/07/2017 09:43, Andrew Leslie wrote:
I would like to support Graeme in his wish to retain Rmerge inTable 1, essentially for exactly the same reasons.
I also strongly support Francis Reyes comment about theusefulness of Rmerge at low resolution, and I would add to hislist that it can also, in some circumstances, be more indicativeof the wrong choice of symmetry (too high) than the statisticsthat come from POINTLESS (excellent though that program is!).
Andrew
On 5 Jul 2017, at 05:44, Graeme Winter<graeme.win...@gmail.com<mailto:graeme.win...@gmail.com>> wrote:
HI Jacob
Yes, I got this - and I appreciate the benefit of Rmeas fordealing with measuring agreement for small-multiplicityobservations. Having this *as well* is very useful and I agreeRmeas / Rpim / CC-half should be the primary “quality” statistics.
However, you asked if there is any reason to *keep* rather than*eliminate* Rmerge, and I offered one :o)
I do not see what harm there is reporting Rmerge, even if it isjust used in the inner shell or just used to capture a flavour ofthe data set overall. I also appreciate that Rmeas converges tothe same value for large multiplicity i.e.:
Overall InnerShellOuterShell
Low resolution limit                       39.02 39.02      1.39
High resolution limit                       1.35 6.04      1.35

Rmerge  (within I+/I-)                     0.080 0.057     2.871
Rmerge  (all I+ and I-)                    0.081 0.059     2.922
Rmeas (within I+/I-)                       0.081 0.058     2.940
Rmeas (all I+ & I-)                        0.082 0.059     2.958
Rpim (within I+/I-)                        0.013 0.009     0.628
Rpim (all I+ & I-)                         0.009 0.007     0.453
Rmerge in top intensity bin                0.050 -         -
Total number of observations             1265512 16212     53490
Total number unique                        17515 224      1280
Mean((I)/sd(I))                             29.7 104.3       1.5
Mn(I) half-set correlation CC(1/2)         1.000 1.000     0.778
Completeness                               100.0 99.7     100.0
Multiplicity                                72.3 72.4      41.8

Anomalous completeness                     100.0 100.0     100.0
Anomalous multiplicity                      37.2 42.7      21.0
DelAnom correlation between half-sets      0.497 0.766    -0.026
Mid-Slope of Anom Normal Probability       1.039 -         -
(this is a good case for Rpim & CC-half as resolution limitcriteria)
If the statistics you want to use are there & some others also,what is the pressure to remove them? Surely we want to educate onhow best to interpret the entire table above to get a fullerpicture of the overall quality of the data? My 0th-order requestwould be to publish the three shells as above ;o)
Cheers Graeme
On 4 Jul 2017, at 22:09, Keller, Jacob<kell...@janelia.hhmi.org<mailto:kell...@janelia.hhmi.org>> wrote:
I suggested replacing Rmerge/sym/cryst with Rmeas, not Rpim.Rmeas is simply (Rmerge * sqrt(n/n-1)) where n is the number ofmeasurements of that reflection. It's merely a way of correctingfor the multiplicity-related artifact of Rmerge, which isbecoming even more of a problem with data sets of increasingvariability in multiplicity. Consider the case of comparing adata set with a multiplicity of 2 versus one of 100: equivalentdata quality would yield Rmerges diverging by a factor of ~1.4.But this has all been covered before in several papers. It can beand is reported in resolution bins, so can used exactly as yousay. So, why not "disappear" Rmerge from the software?
The only reason I could come up with for keeping it is historicalreasons or comparisons to previous datasets, but anyway thosecomparisons would be confounded by variabities in multiplicityand a hundred other things, so come on, developers, just commentit out!
JPK




-----Original Message-----
From:graeme.win...@diamond.ac.uk<mailto:graeme.win...@diamond.ac.uk>[mailto:graeme.win...@diamond.ac.uk]
Sent: Tuesday, July 04, 2017 4:37 PM
To: Keller, Jacob<kell...@janelia.hhmi.org<mailto:kell...@janelia.hhmi.org>>
Cc: ccp4bb@jiscmail.ac.uk<mailto:ccp4bb@jiscmail.ac.uk>
Subject: Re: [ccp4bb] Rmergicide Through Programming

HI Jacob
Unbiased estimate of the true unmerged I/sig(I) of your data (Ifind this particularly useful at low resolution) i.e. if yourinner shell Rmerge is 10% your data agree very poorly; if 2% saysyour data agree very well provided you have sensiblemultiplicity… obviously depends on sensible interpretation. Rpimhides this (though tells you more about the quality of averagemeasurement)
Essentially, for I/sig(I) you can (by and large) adjust yoursig(I) values however you like if you were so inclined. You canonly adjust Rmerge by excluding measurements.
I would therefore defend that - amongst the other stats youenumerate below - it still has a place
Cheers Graeme
On 4 Jul 2017, at 14:10, Keller, Jacob<kell...@janelia.hhmi.org<mailto:kell...@janelia.hhmi.org>> wrote:
Rmerge does contain information which complements the others.
What information? I was trying to think of a counterargument towhat I proposed, but could not think of a reason in the world tokeep reporting it.
JPK
On 4 Jul 2017, at 12:00, Keller, Jacob<kell...@janelia.hhmi.org<mailto:kell...@janelia.hhmi.org><mailto:kell...@janelia.hhmi.org>>wrote:
Dear Crystallographers,
Having been repeatedly chagrinned about the continued use andreporting of Rmerge rather than Rmeas or similar, I thought of apotential way to promote the change: what if merging programswould completely omit Rmerge/cryst/sym? Is there some reason tocontinue to report these stats, or are they just grandfatheredinto the software? I doubt that any journal or crystallographerwould insist on reporting Rmerge per se. So, I wonder whatdevelopers would think about commenting out a few lines of theircode, seeing what happens? Maybe a comment to the effect of"Rmerge is now deprecated; use Rmeas" would be useful as well.Would something catastrophic happen?
All the best,

Jacob Keller

*******************************************
Jacob Pearson Keller, PhD
Research Scientist
HHMI Janelia Research Campus / Looger lab
Phone: (571)209-4000 x3159
Email:kell...@janelia.hhmi.org<mailto:kell...@janelia.hhmi.org><mailto:kell...@janelia.hhmi.org>
*******************************************


--
This e-mail and any attachments may contain confidential,copyright and or privileged material, and are for the use of theintended addressee only. If you are not the intended addressee oran authorised recipient of the addressee please notify us ofreceipt by returning the e-mail and do not use, copy, retain,distribute or disclose the information in or attached to the e-mail.Any opinions expressed within this e-mail are those of theindividual and not necessarily of Diamond Light Source Ltd.Diamond Light Source Ltd. cannot guarantee that this e-mail orany attachments are free from viruses and we cannot acceptliability for any damage which you may sustain as a result ofsoftware viruses which may be transmitted in or with the message.Diamond Light Source Limited (company no. 4375679). Registered inEngland and Wales with its registered office at Diamond House,Harwell Science and Innovation Campus, Didcot, Oxfordshire, OX110DE, United Kingdom

Re: [ccp4bb] Rmergicide Through Programming

Reply via email to