Re: [ccp4bb] Rmergicide Through Programming

Keller, Jacob Fri, 07 Jul 2017 12:15:38 -0700

Not so fast:

First of all, I cannot remember having ever come across a paper reporting a 
multiplicity of around 1, and if there are such cases, they are so rare that 
they are not worth accounting for, and should raise eyebrows in the first 
place, cast into doubt all of the statistics, let alone the structure. What can 
one do with a data set like this in terms of stats? Pretty much nothing. This 
could be answered by a multiplicity search in the pdb, I guess. Okay, just did 
this, and there are 3,076 structures with overall multiplicities of 0-2, and 
92,807 with > 2. Okay, so 3.3%, which is more than I would have thought. If you 
cut it down to 1.5, though, there are only 400, so 0.4%.


Second, according to the CCP4 wiki, it seems that both Rmerge and Rmeas sums do 
not include reflections in which there is only one measurement (a bit ambiguous 
in the wording though). How, for example, can one calculate the [I - <I>] term 
with only one reflection? Just call it 0? We’d have to confirm this with the 
developers, but it seems that neither R includes n=1 reflections, and 
therefore, the infinite denominator problem is non-existent.

Also, the calculation for Rmeas corrects for the multiplicity of each 
reflection, so cannot really “approach” 1.

Third, I cannot see how Rmerge is any more impervious to manipulation—consider 
Uriah’s script in my recent foray into fiction. Also, if I am wrong and Rmerge 
does include n=1 reflections, what better way to make R values become 0 than 
setting the multiplicity = 1 for all reflections? Perfect data! (at least when 
measured by Rmerge.) Further, since Rmeas corrects for multiplicity on a 
per-reflection basis, multiplicity can be either 1, in which case the 
reflection is not included, or more, in which case the infinity problem 
disappears.

Fourth, I liked at first the reason of evaluation of data collection equipment, 
but this is a pretty special case generally not relevant to publishing 
structures, and one would be further stymied much more by the variability of 
the sample(s).

I still am not convinced, and suspect perhaps the Rmeas folks can/will answer 
better than I.

All the best,

Jacob


From: CCP4 bulletin board [mailto:CCP4BB@JISCMAIL.AC.UK] On Behalf Of Frank von 
Delft
Sent: Friday, July 07, 2017 2:26 PM
To: CCP4BB@JISCMAIL.AC.UK
Subject: Re: [ccp4bb] Rmergicide Through Programming


Okay, that is a strong answer:  Rmeas has too many infinities for comfort.  
Thanks, very instructive yet again!

phx


On 07/07/2017 18:57, James Holton wrote:
I happen to be one of those people who think Rmerge is a very useful statistic. 
 Not as a method of evaluating the resolution limit, which is mathematically 
ridiculous, but for a host of other important things, like evaluating the 
performance of data collection equipment, and evaluating the isomorphism of 
different crystals, to name a few.

I like Rmerge because it is a simple statistic that has a simple formula and 
has not undergone any "corrections".  Corrections increase complexity, and 
complexity opens the door to manipulation by the desperate and/or misguided.  
For example, overzealous outlier rejection is a common way to abuse R factors, 
and it is far too often swept under the rug, sometimes without the user even 
knowing about it.  This is especially problematic when working in a regime 
where the statistic of interest is unstable, and for R factors this is low 
intensity data.  Rejecting just the right "outliers" can make any R factor look 
a lot better.  Why would Rmeas be any more unstable than Rmerge?  Look at the 
formula. There is an "n-1" in the denominator, where n is the multiplicity.  
So, what happens when n approaches 1 ?  What happens when n=1? This is not to 
say Rmerge is better than Rmeas. In fact, I believe the latter is generally 
superior to the first, unless you are working near n = 1. The sqrt(n/(n-1)) is 
trying to correct for bias in the R statistic, but fighting one infinity with 
another infinity is a dangerous game.

My point is that neither Rmerge nor Rmeas are easily interpreted without 
knowing the multiplicity.  If you see Rmeas = 10% and the multiplicity is 10, 
then you know what that means.  Same for Rmerge, since at n=10 both stats have 
nearly the same value.  But if you have Rmeas = 45% and multiplicity = 1.05, 
what does that mean?  Rmeas will be only 33% if the multiplicity is rounded up 
to 1.1. This is what I mean by "numerical instability", the value of the R 
statistic itself becomes sensitive to small amounts of noise, and behaves more 
and more like a random number generator. And if you have Rmeas = 33% and no 
indication of multiplicity, it is hard to know what is going on.  I personally 
am a lot more comfortable seeing qualitative agreement between Rmerge and 
Rmeas, because that means the numerical instability of the multiplicity 
correction didn't mess anything up.

Of course, when the intensity is weak R statistics in general are not useful.  
Both Rmeas and Rmerge have the sum of all intensities in the denominator, so 
when the bin-wide sum approaches zero you have another infinity to contend 
with.  This one starts to rear its ugly head once I/sigma drops below about 3, 
and this is why our ancestors always applied a sigma cutoff before computing an 
R factor.  Our small-molecule colleagues still do this!  They call it "R1".  
And it is an excellent indicator of the overall relative error.  The relative 
error in the outermost bin is not meaningful, and strangely enough nobody ever 
reported the outer-resolution Rmerge before 1995.

For weak signals, Correlation Coefficients are better, but for strong signals 
CC pegs out at >95%, making it harder to see relative errors.  I/sigma is what 
we'd like to know, but the value of "sigma" is still prone to manipulation by 
not just outlier rejection, but massaging the so-called "error model".  Suffice 
it to say, crystallographic data contain more than one type of error.  Some 
sources are important for weak spots, others are important for strong spots, 
and still others are only apparent in the mid-range.  Some sources of error are 
only important at low multiplicity, and others only manifest at high 
multiplicity. There is no single number that can be used to evaluate all 
aspects of data quality.

So, I remain a champion of reporting Rmerge.  Not in the high-angle bin, 
because that is essentially a random number, but overall Rmerge and 
low-angle-bin Rmerge next to multiplicity, Rmeas, CC1/2 and other statistics is 
the only way you can glean enough information about where the errors are coming 
from in the data.  Rmeas is a useful addition because it helps us correct for 
multiplicity without having to do math in our head.  Users generally thank you 
for that. Rmerge, however, has served us well for more than half a century, and 
I believe Uli Arndt knew what he was doing.  I hope we all know enough about 
history to realize that future generations seldom thank their ancestors for 
"protecting" them from information.

-James Holton
MAD Scientist


On 7/5/2017 10:36 AM, Graeme Winter wrote:

Frank,

you are asking me to remove features that I like, so I would feel that the 
challenge is for you to prove that this is harmful however:

  - at the minimum, I find it a useful check sum that the stats are internally 
consistent (though I interpret it for lots of other reasons too)
  - it is faulty I agree, but (with caveats) still useful IMHO

Sorry for being terse, but I remain to be convinced that removing it increases 
the amount of information

CC’ing BB as requested

Best wishes Graeme



On 5 Jul 2017, at 17:17, Frank von Delft 
<frank.vonde...@sgc.ox.ac.uk><mailto:frank.vonde...@sgc.ox.ac.uk> wrote:

You keep not answering the challenge.

It's really simple:  what information does Rmerge provide that Rmeas doesn't.

(If you answer, email to the BB.)


On 05/07/2017 16:04, 
graeme.win...@diamond.ac.uk<mailto:graeme.win...@diamond.ac.uk> wrote:

Dear Frank,

You are forcefully arguing essentially that others are wrong if we feel an 
existing statistic continues to be useful, and instead insist that it be 
outlawed so that we may not make use of it, just in case someone misinterprets 
it.

Very well

I do however express disquiet that we as software developers feel browbeaten to 
remove the output we find useful because “the community” feel that it is 
obsolete.

I feel that Jacob’s short story on this thread illustrates that educating the 
next generation of crystallographers to understand what all of the numbers mean 
is critical, and that a numerological approach of trying to optimise any one 
statistic is essentially doomed. Precisely the same argument could be made for 
people cutting the “resolution” at the wrong place in order to improve the 
average I/sig(I) of the data set.

Denying access to information is not a solution to misinterpretation, from 
where I am sat, however I acknowledge that other points of view exist.

Best wishes Graeme


On 5 Jul 2017, at 12:11, Frank von Delft 
<frank.vonde...@sgc.ox.ac.uk<mailto:frank.vonde...@sgc.ox.ac.uk><mailto:frank.vonde...@sgc.ox.ac.uk><mailto:frank.vonde...@sgc.ox.ac.uk>>
 wrote:


Graeme, Andrew

Jacob is not arguing against an R-based statistic;  he's pointing out that 
leaving out the multiplicity-weighting is prehistoric (Diederichs & Karplus 
published it 20 years ago!).

So indeed:   Rmerge, Rpim and I/sigI give different information.  As you say.

But no:   Rmerge and Rmeas and Rcryst do NOT give different information.  
Except:

   * Rmerge is a (potentially) misleading version of Rmeas.

   * Rcryst and Rmerge and Rsym are terms that no longer have significance in 
the single cryo-dataset world.

phx.



On 05/07/2017 09:43, Andrew Leslie wrote:

I would like to support Graeme in his wish to retain Rmerge in Table 1, 
essentially for exactly the same reasons.

I also strongly support Francis Reyes comment about the usefulness of Rmerge at 
low resolution, and I would add to his list that it can also, in some 
circumstances, be more indicative of the wrong choice of symmetry (too high) 
than the statistics that come from POINTLESS (excellent though that program 
is!).

Andrew
On 5 Jul 2017, at 05:44, Graeme Winter 
<graeme.win...@gmail.com<mailto:graeme.win...@gmail.com><mailto:graeme.win...@gmail.com><mailto:graeme.win...@gmail.com>>
 wrote:

HI Jacob

Yes, I got this - and I appreciate the benefit of Rmeas for dealing with 
measuring agreement for small-multiplicity observations. Having this *as well* 
is very useful and I agree Rmeas / Rpim / CC-half should be the primary 
“quality” statistics.

However, you asked if there is any reason to *keep* rather than *eliminate* 
Rmerge, and I offered one :o)

I do not see what harm there is reporting Rmerge, even if it is just used in 
the inner shell or just used to capture a flavour of the data set overall. I 
also appreciate that Rmeas converges to the same value for large multiplicity 
i.e.:

                                            Overall  InnerShell  OuterShell
Low resolution limit                       39.02     39.02      1.39
High resolution limit                       1.35      6.04      1.35

Rmerge  (within I+/I-)                     0.080     0.057     2.871
Rmerge  (all I+ and I-)                    0.081     0.059     2.922
Rmeas (within I+/I-)                       0.081     0.058     2.940
Rmeas (all I+ & I-)                        0.082     0.059     2.958
Rpim (within I+/I-)                        0.013     0.009     0.628
Rpim (all I+ & I-)                         0.009     0.007     0.453
Rmerge in top intensity bin                0.050        -         -
Total number of observations             1265512     16212     53490
Total number unique                        17515       224      1280
Mean((I)/sd(I))                             29.7     104.3       1.5
Mn(I) half-set correlation CC(1/2)         1.000     1.000     0.778
Completeness                               100.0      99.7     100.0
Multiplicity                                72.3      72.4      41.8

Anomalous completeness                     100.0     100.0     100.0
Anomalous multiplicity                      37.2      42.7      21.0
DelAnom correlation between half-sets      0.497     0.766    -0.026
Mid-Slope of Anom Normal Probability       1.039       -         -

(this is a good case for Rpim & CC-half as resolution limit criteria)

If the statistics you want to use are there & some others also, what is the 
pressure to remove them? Surely we want to educate on how best to interpret the 
entire table above to get a fuller picture of the overall quality of the data? 
My 0th-order request would be to publish the three shells as above ;o)

Cheers Graeme



On 4 Jul 2017, at 22:09, Keller, Jacob 
<kell...@janelia.hhmi.org<mailto:kell...@janelia.hhmi.org><mailto:kell...@janelia.hhmi.org><mailto:kell...@janelia.hhmi.org>>
 wrote:

I suggested replacing Rmerge/sym/cryst with Rmeas, not Rpim. Rmeas is simply 
(Rmerge * sqrt(n/n-1)) where n is the number of measurements of that 
reflection. It's merely a way of correcting for the multiplicity-related 
artifact of Rmerge, which is becoming even more of a problem with data sets of 
increasing variability in multiplicity. Consider the case of comparing a data 
set with a multiplicity of 2 versus one of 100: equivalent data quality would 
yield Rmerges diverging by a factor of ~1.4. But this has all been covered 
before in several papers. It can be and is reported in resolution bins, so can 
used exactly as you say. So, why not "disappear" Rmerge from the software?

The only reason I could come up with for keeping it is historical reasons or 
comparisons to previous datasets, but anyway those comparisons would be 
confounded by variabities in multiplicity and a hundred other things, so come 
on, developers, just comment it out!

JPK




-----Original Message-----
From: 
graeme.win...@diamond.ac.uk<mailto:graeme.win...@diamond.ac.uk><mailto:graeme.win...@diamond.ac.uk><mailto:graeme.win...@diamond.ac.uk>
 [mailto:graeme.win...@diamond.ac.uk]
Sent: Tuesday, July 04, 2017 4:37 PM
To: Keller, Jacob 
<kell...@janelia.hhmi.org<mailto:kell...@janelia.hhmi.org><mailto:kell...@janelia.hhmi.org><mailto:kell...@janelia.hhmi.org>>
Cc: 
ccp4bb@jiscmail.ac.uk<mailto:ccp4bb@jiscmail.ac.uk><mailto:ccp4bb@jiscmail.ac.uk><mailto:ccp4bb@jiscmail.ac.uk>
Subject: Re: [ccp4bb] Rmergicide Through Programming

HI Jacob

Unbiased estimate of the true unmerged I/sig(I) of your data (I find this 
particularly useful at low resolution) i.e. if your inner shell Rmerge is 10% 
your data agree very poorly; if 2% says your data agree very well provided you 
have sensible multiplicity… obviously depends on sensible interpretation. Rpim 
hides this (though tells you more about the quality of average measurement)

Essentially, for I/sig(I) you can (by and large) adjust your sig(I) values 
however you like if you were so inclined. You can only adjust Rmerge by 
excluding measurements.

I would therefore defend that - amongst the other stats you enumerate below - 
it still has a place

Cheers Graeme

On 4 Jul 2017, at 14:10, Keller, Jacob 
<kell...@janelia.hhmi.org<mailto:kell...@janelia.hhmi.org><mailto:kell...@janelia.hhmi.org><mailto:kell...@janelia.hhmi.org>>
 wrote:

Rmerge does contain information which complements the others.

What information? I was trying to think of a counterargument to what I 
proposed, but could not think of a reason in the world to keep reporting it.

JPK


On 4 Jul 2017, at 12:00, Keller, Jacob 
<kell...@janelia.hhmi.org<mailto:kell...@janelia.hhmi.org><mailto:kell...@janelia.hhmi.org><mailto:kell...@janelia.hhmi.org><mailto:kell...@janelia.hhmi.org><mailto:kell...@janelia.hhmi.org>>
 wrote:

Dear Crystallographers,

Having been repeatedly chagrinned about the continued use and reporting of 
Rmerge rather than Rmeas or similar, I thought of a potential way to promote 
the change: what if merging programs would completely omit Rmerge/cryst/sym? Is 
there some reason to continue to report these stats, or are they just 
grandfathered into the software? I doubt that any journal or crystallographer 
would insist on reporting Rmerge per se. So, I wonder what developers would 
think about commenting out a few lines of their code, seeing what happens? 
Maybe a comment to the effect of "Rmerge is now deprecated; use Rmeas" would be 
useful as well. Would something catastrophic happen?

All the best,

Jacob Keller

*******************************************
Jacob Pearson Keller, PhD
Research Scientist
HHMI Janelia Research Campus / Looger lab
Phone: (571)209-4000 x3159
Email: 
kell...@janelia.hhmi.org<mailto:kell...@janelia.hhmi.org><mailto:kell...@janelia.hhmi.org><mailto:kell...@janelia.hhmi.org><mailto:kell...@janelia.hhmi.org><mailto:kell...@janelia.hhmi.org>
*******************************************


--
This e-mail and any attachments may contain confidential, copyright and or 
privileged material, and are for the use of the intended addressee only. If you 
are not the intended addressee or an authorised recipient of the addressee 
please notify us of receipt by returning the e-mail and do not use, copy, 
retain, distribute or disclose the information in or attached to the e-mail.
Any opinions expressed within this e-mail are those of the individual and not 
necessarily of Diamond Light Source Ltd.
Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments 
are free from viruses and we cannot accept liability for any damage which you 
may sustain as a result of software viruses which may be transmitted in or with 
the message.
Diamond Light Source Limited (company no. 4375679). Registered in England and 
Wales with its registered office at Diamond House, Harwell Science and 
Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom

Re: [ccp4bb] Rmergicide Through Programming

Reply via email to