Re: [ccp4bb] AW: [ccp4bb] Rmergicide Through Programming

John Berrisford Mon, 10 Jul 2017 07:20:06 -0700

Dear Herman

The new PDB deposition system (OneDep) allows you to enter values forRmerge, Rsym, Rpim, Rrim and / or CC half. If, during deposition, you donot provide a value for any of these metrics then we will ask you for avalue for one of them.

Also, PDB format is a legacy format for the PDB. In 2014 mmCIF becamethe archive format for the PDB and some large entries are no longerdistributed in PDB format. mmCIF is not limited by the constraints ofpunch cards.


Please see https://www.wwpdb.org/documentation/file-formats-and-the-pdb

Regards

John

PDBe



On 10/07/2017 09:26, [email protected] wrote:

Dear All,

For me this whole discussion is an example of a large number of people barking 
at the wrong tree. The real issue is not whether data processing programs print 
amongst many quality indicators an Rmerge as well, but the fact that the PDB 
and many journals still insist on using the Rmerge as primary quality 
indicator. As long as this is true, novice scientist might be led to believe 
that Rmerge is the most important quality indicator. As soon as the PDB and the 
journals request some other indicator, this will be over. So that is where we 
should direct our efforts to.

I don't understand at all, why the PDB still insists on an obsolete quality 
indicator. However, the PDB format for the coordinates also dates back to the 
1960's to be used with punch cards.

My 2 cents.
Herman



-----Ursprüngliche Nachricht-----
Von: CCP4 bulletin board [mailto:[email protected]] Im Auftrag von Edward 
A. Berry
Gesendet: Samstag, 8. Juli 2017 22:31
An: [email protected]
Betreff: Re: [ccp4bb] Rmergicide Through Programming

But R-merge is not really narrower as a fraction of the mean value- it just 
gets smaller proportionantly as all the numbers get smaller:
RMSD of .0043 for R-meas multiplied by factor of 0.022/.027 gives 0.0035 which 
is the RMSD for Rmerge. The same was true in the previous example. You could 
multiply R-meas by .5 or .2 and get a sharper distribution yet! And that factor 
would be constant, where this only applies for super-low redundancy.

On 07/08/2017 03:23 PM, James Holton wrote:

The expected distribution of Rmeas values is still wider than that of Rmerge 
for data with I/sigma=30 and average multiplicity=2.0. Graph attached.

I expect that anytime you incorporate more than one source of information you 
run the risk of a noisier statistic because every source of information can 
contain noise.  That is, Rmeas combines information about multiplicity with the 
absolute deviates in the data to form a statistic that is more accurate that 
Rmerge, but also (potentially) less precise.

Perhaps that is what we are debating here?  Which is better? accuracy or 
precision?  Personally, I prefer to know both.

-James Holton
MAD Scientist

On 7/8/2017 11:02 AM, Frank von Delft wrote:

It is quite easy to end up with low multiplicities in the low resolution shell, 
especially for low symmetry and fast-decaying crystals.

It is this scenario where Rmerge (lowres) is more misleading than Reas.

phx


On 08/07/2017 17:31, James Holton wrote:

What does Rmeas tell us that Rmerge doesn't?  Given that we know the 
multiplicity?

-James Holton
MAD Scientist

On 7/8/2017 9:15 AM, Frank von Delft wrote:

Anyway, back to reality:  does anybody still use R statistics to evaluate anything other than 
/strong/ data?  Certainly I never look at it except for the low-resolution bin (or strongest 
reflections). Specifically, a "2%-dataset" in that bin is probably healthy, while a 
"9%-dataset" probably Has Issues.

In which case, back to Jacob's question:  what does Rmerge tell us that Rmeas 
doesn't.

phx




On 08/07/2017 17:02, James Holton wrote:

Sorry for the confusion.  I was going for brevity!  And failed.

I know that the multiplicity correction is applied on a per-hkl basis in the 
calculation of Rmeas.  However, the average multiplicity over the whole 
calculation is most likely not an integer. Some hkls may be observed twice 
while others only once, or perhaps 3-4 times in the same scaling run.

Allow me to do the error propagation properly.  Consider the scenario:

Your outer resolution bin has a true I/sigma = 1.00 and average multiplicity of 2.0. 
Let's say there are 100 hkl indices in this bin.  I choose the "true" 
intensities of each hkl from an exponential (aka Wilson) distribution. Further assume the 
background is high, so the error in each observation after background subtraction may be 
taken from a Gaussian distribution. Let's further choose the per-hkl multiplicity from a 
Poisson distribution with expectation value 2.0, so 0 is possible, but the long-term 
average multiplicity is 2.0. For R calculation, when multiplicity of any given hkl is 
less than 2 it is skipped. What I end up with after 120,000 trials is a distribution of 
values for each R factor.  See attached graph.

What I hope is readily apparent is that the distribution of Rmerge
values is taller and sharper than that of the Rmeas values.  The most likely Rmeas is 80% 
and that of Rmerge is 64.6%.  This is expected, of course.  But what I hope to impress 
upon you is that the most likely value is not generally the one that you will get! The 
distribution has a width.  Specifically, Rmeas could be as low as 40%, or as high as 
209%, depending on the trial.  Half of the trial results falling between 71.4% and 90.3%, 
a range of 19 percentage points.  Rmerge has a middle-half range from 57.6% to 72.9% 
(15.3 percentage points).  This range of possible values of Rmerge or Rmeas from data 
with the same intrinsic quality is what I mean when I say "numerical 
instability".  Each and every trial had the same true I/sigma and multiplicity, and 
yet the R factors I get vary depending on the trial.  Unfortunately for most of us with 
real data, you only ever get one trial, and you can't predict which Rmeas or Rmerge 
you'll get.

My point here is that R statistics in general are not comparable from 
experiment to experiment when you are looking at data with low average 
intensity and low multiplicity, and it appears that Rmeas is less stable than 
Rmerge.  Not by much, mind you, but still jumps around more.

Hope that is clearer?

Note that in no way am I suggesting that low-multiplicity is the right way to 
collect data.  Far from it.  Especially with modern detectors that have 
negligible read-out noise. But when micro crystals only give off a handful of 
photons each before they die, low multiplicity might be all you have.

-James Holton
MAD Scientist



On 7/7/2017 2:33 PM, Edward A. Berry wrote:

I think the confusion here is that the "multiplicity correction"
is applied on each reflection, where it will be an integer 2 or
greater (can't estimate variance with only one measurement). You
can only correct in an approximate way using using the average
multiplicity of the dataset, since it would depend on the distribution of 
multiplicity over the reflections.

And the correction is for r-merge. You don't need to apply a
correction to R-meas.
R-meas is a redundancy-independent best estimate of the variance.
Whatever you would have used R-merge for (hopefully taking
allowance for the multiplicity) you can use R-meas and not worry about 
multiplicity.
Again, what information does R-merge provide that R-meas does not
provide in a more accurate way?

According to the denso manual, one way to artificially reduce
R-merge is to include reflections with only one measure
(averaging in a lot of zero's always helps bring an average
down), and they say there were actually some programs that did
that. However I'm quite sure none of the ones we rely on today do that.

On 07/07/2017 03:12 PM, Kay Diederichs wrote:

James,

I cannot follow you. "n approaches 1" can only mean n = 2 because n is integer. 
And for n=2 the sqrt(n/(n-1)) factor is well-defined. For n=1, neither contributions to 
Rmeas nor Rmerge nor to any other precision indicator can be calculated anyway, because 
there's nothing this measurement can be compared against.

just my 2 cents,

Kay

On Fri, 7 Jul 2017 10:57:17 -0700, James Holton <[email protected]> 
wrote:

I happen to be one of those people who think Rmerge is a very
useful statistic.  Not as a method of evaluating the resolution
limit, which is mathematically ridiculous, but for a host of
other important things, like evaluating the performance of data
collection equipment, and evaluating the isomorphism of different crystals, to 
name a few.

I like Rmerge because it is a simple statistic that has a
simple formula and has not undergone any "corrections".
Corrections increase complexity, and complexity opens the door
to manipulation by the desperate and/or misguided.  For
example, overzealous outlier rejection is a common way to abuse
R factors, and it is far too often swept under the rug,
sometimes without the user even knowing about it. This is
especially problematic when working in a regime where the statistic of interest 
is unstable, and for R factors this is low intensity data.
Rejecting just the right "outliers" can make any R factor look
a lot better.  Why would Rmeas be any more unstable than
Rmerge? Look at the formula. There is an "n-1" in the
denominator, where n is the multiplicity.  So, what happens
when n approaches 1 ? What happens when n=1? This is not to say
Rmerge is better than Rmeas. In fact, I believe the latter is
generally superior to the first, unless you are working near n
= 1. The sqrt(n/(n-1)) is trying to correct for bias in the R
statistic, but fighting one infinity with another infinity is a dangerous game.

My point is that neither Rmerge nor Rmeas are easily
interpreted without knowing the multiplicity.  If you see Rmeas
= 10% and the multiplicity is 10, then you know what that
means.  Same for Rmerge, since at n=10 both stats have nearly
the same value.  But if you have Rmeas = 45% and multiplicity =
1.05, what does that mean?  Rmeas will be only 33% if the
multiplicity is rounded up to 1.1. This is what I mean by
"numerical instability", the value of the R statistic itself
becomes sensitive to small amounts of noise, and behaves more
and more like a random number generator. And if you have Rmeas
= 33% and no indication of multiplicity, it is hard to know
what is going on.  I personally am a lot more comfortable
seeing qualitative agreement between Rmerge and Rmeas, because that means the 
numerical instability of the multiplicity correction didn't mess anything up.

Of course, when the intensity is weak R statistics in general
are not useful.  Both Rmeas and Rmerge have the sum of all
intensities in the denominator, so when the bin-wide sum
approaches zero you have another infinity to contend with.
This one starts to rear its ugly head once I/sigma drops below
about 3, and this is why our ancestors always applied a sigma
cutoff before computing an R factor. Our small-molecule
colleagues still do this!  They call it "R1".  And it is an
excellent indicator of the overall relative error.  The
relative error in the outermost bin is not meaningful, and strangely enough 
nobody ever reported the outer-resolution Rmerge before 1995.

For weak signals, Correlation Coefficients are better, but for
strong signals CC pegs out at >95%, making it harder to see relative errors.
I/sigma is what we'd like to know, but the value of "sigma" is
still prone to manipulation by not just outlier rejection, but
massaging the so-called "error model".  Suffice it to say,
crystallographic data contain more than one type of error.
Some sources are important for weak spots, others are important
for strong spots, and still others are only apparent in the
mid-range.  Some sources of error are only important at low
multiplicity, and others only manifest at high multiplicity.
There is no single number that can be used to evaluate all aspects of data 
quality.

So, I remain a champion of reporting Rmerge.  Not in the
high-angle bin, because that is essentially a random number,
but overall Rmerge and low-angle-bin Rmerge next to
multiplicity, Rmeas, CC1/2 and other statistics is the only way
you can glean enough information about where the errors are
coming from in the data.  Rmeas is a useful addition because it
helps us correct for multiplicity without having to do math in
our head.  Users generally thank you for that. Rmerge, however,
has served us well for more than half a century, and I believe
Uli Arndt knew what he was doing.  I hope we all know enough
about history to realize that future generations seldom thank their ancestors for 
"protecting" them from information.

-James Holton
MAD Scientist


On 7/5/2017 10:36 AM, Graeme Winter wrote:

Frank,

you are asking me to remove features that I like, so I would feel that the 
challenge is for you to prove that this is harmful however:

    - at the minimum, I find it a useful check sum that the stats are 
internally consistent (though I interpret it for lots of other reasons too)
    - it is faulty I agree, but (with caveats) still useful
IMHO

Sorry for being terse, but I remain to be convinced that
removing it increases the amount of information

CC’ing BB as requested

Best wishes Graeme

On 5 Jul 2017, at 17:17, Frank von Delft <[email protected]> wrote:

You keep not answering the challenge.

It's really simple:  what information does Rmerge provide that Rmeas doesn't.

(If you answer, email to the BB.)


On 05/07/2017 16:04, [email protected] wrote:

Dear Frank,

You are forcefully arguing essentially that others are wrong if we feel an 
existing statistic continues to be useful, and instead insist that it be 
outlawed so that we may not make use of it, just in case someone misinterprets 
it.

Very well

I do however express disquiet that we as software developers feel browbeaten to 
remove the output we find useful because “the community” feel that it is 
obsolete.

I feel that Jacob’s short story on this thread illustrates that educating the 
next generation of crystallographers to understand what all of the numbers mean 
is critical, and that a numerological approach of trying to optimise any one 
statistic is essentially doomed. Precisely the same argument could be made for 
people cutting the “resolution” at the wrong place in order to improve the 
average I/sig(I) of the data set.

Denying access to information is not a solution to misinterpretation, from 
where I am sat, however I acknowledge that other points of view exist.

Best wishes Graeme


On 5 Jul 2017, at 12:11, Frank von Delft 
<[email protected]<mailto:[email protected]>> wrote:


Graeme, Andrew

Jacob is not arguing against an R-based statistic;  he's pointing out that leaving 
out the multiplicity-weighting is prehistoric (Diederichs & Karplus published 
it 20 years ago!).

So indeed:   Rmerge, Rpim and I/sigI give different information.  As you say.

But no:   Rmerge and Rmeas and Rcryst do NOT give different information.  
Except:

     * Rmerge is a (potentially) misleading version of Rmeas.

     * Rcryst and Rmerge and Rsym are terms that no longer have significance in 
the single cryo-dataset world.

phx.



On 05/07/2017 09:43, Andrew Leslie wrote:

I would like to support Graeme in his wish to retain Rmerge in Table 1, 
essentially for exactly the same reasons.

I also strongly support Francis Reyes comment about the usefulness of Rmerge at 
low resolution, and I would add to his list that it can also, in some 
circumstances, be more indicative of the wrong choice of symmetry (too high) 
than the statistics that come from POINTLESS (excellent though that program 
is!).

Andrew
On 5 Jul 2017, at 05:44, Graeme Winter 
<[email protected]<mailto:[email protected]>> wrote:

HI Jacob

Yes, I got this - and I appreciate the benefit of Rmeas for dealing with 
measuring agreement for small-multiplicity observations. Having this *as well* 
is very useful and I agree Rmeas / Rpim / CC-half should be the primary 
“quality” statistics.

However, you asked if there is any reason to *keep* rather
than *eliminate* Rmerge, and I offered one :o)

I do not see what harm there is reporting Rmerge, even if it is just used in 
the inner shell or just used to capture a flavour of the data set overall. I 
also appreciate that Rmeas converges to the same value for large multiplicity 
i.e.:

Overall InnerShell  OuterShell
Low resolution limit                       39.02 39.02      1.39
High resolution limit                       1.35 6.04      1.35

Rmerge  (within I+/I-)                     0.080 0.057     2.871
Rmerge  (all I+ and I-)                    0.081 0.059     2.922
Rmeas (within I+/I-)                       0.081 0.058     2.940
Rmeas (all I+ & I-) 0.082 0.059     2.958
Rpim (within I+/I-)                        0.013 0.009     0.628
Rpim (all I+ & I-) 0.009 0.007     0.453
Rmerge in top intensity bin                0.050 -         -
Total number of observations             1265512 16212     53490
Total number unique                        17515 224      1280
Mean((I)/sd(I))                             29.7 104.3       1.5
Mn(I) half-set correlation CC(1/2)         1.000 1.000     0.778
Completeness                               100.0 99.7     100.0
Multiplicity                                72.3 72.4      41.8

Anomalous completeness                     100.0 100.0     100.0
Anomalous multiplicity                      37.2 42.7      21.0
DelAnom correlation between half-sets      0.497 0.766    -0.026
Mid-Slope of Anom Normal Probability       1.039 -         -

(this is a good case for Rpim & CC-half as resolution limit
criteria)

If the statistics you want to use are there & some others
also, what is the pressure to remove them? Surely we want to
educate on how best to interpret the entire table above to
get a fuller picture of the overall quality of the data? My
0th-order request would be to publish the three shells as
above ;o)

Cheers Graeme



On 4 Jul 2017, at 22:09, Keller, Jacob 
<[email protected]<mailto:[email protected]>> wrote:

I suggested replacing Rmerge/sym/cryst with Rmeas, not Rpim. Rmeas is simply (Rmerge * 
sqrt(n/n-1)) where n is the number of measurements of that reflection. It's merely a way 
of correcting for the multiplicity-related artifact of Rmerge, which is becoming even 
more of a problem with data sets of increasing variability in multiplicity. Consider the 
case of comparing a data set with a multiplicity of 2 versus one of 100: equivalent data 
quality would yield Rmerges diverging by a factor of ~1.4. But this has all been covered 
before in several papers. It can be and is reported in resolution bins, so can used 
exactly as you say. So, why not "disappear" Rmerge from the software?

The only reason I could come up with for keeping it is historical reasons or 
comparisons to previous datasets, but anyway those comparisons would be 
confounded by variabities in multiplicity and a hundred other things, so come 
on, developers, just comment it out!

JPK




-----Original Message-----
From:
[email protected]<mailto:[email protected].
uk> [mailto:[email protected]]
Sent: Tuesday, July 04, 2017 4:37 PM
To: Keller, Jacob
<[email protected]<mailto:[email protected]>>
Cc: [email protected]<mailto:[email protected]>
Subject: Re: [ccp4bb] Rmergicide Through Programming

HI Jacob

Unbiased estimate of the true unmerged I/sig(I) of your data
(I find this particularly useful at low resolution) i.e. if
your inner shell Rmerge is 10% your data agree very poorly;
if 2% says your data agree very well provided you have
sensible multiplicity… obviously depends on sensible
interpretation. Rpim hides this (though tells you more about
the quality of average measurement)

Essentially, for I/sig(I) you can (by and large) adjust your sig(I) values 
however you like if you were so inclined. You can only adjust Rmerge by 
excluding measurements.

I would therefore defend that - amongst the other stats you
enumerate below - it still has a place

Cheers Graeme

On 4 Jul 2017, at 14:10, Keller, Jacob 
<[email protected]<mailto:[email protected]>> wrote:

Rmerge does contain information which complements the others.

What information? I was trying to think of a counterargument to what I 
proposed, but could not think of a reason in the world to keep reporting it.

JPK


On 4 Jul 2017, at 12:00, Keller, Jacob 
<[email protected]<mailto:[email protected]><mailto:[email protected]>>
 wrote:

Dear Crystallographers,

Having been repeatedly chagrinned about the continued use and reporting of Rmerge rather 
than Rmeas or similar, I thought of a potential way to promote the change: what if 
merging programs would completely omit Rmerge/cryst/sym? Is there some reason to continue 
to report these stats, or are they just grandfathered into the software? I doubt that any 
journal or crystallographer would insist on reporting Rmerge per se. So, I wonder what 
developers would think about commenting out a few lines of their code, seeing what 
happens? Maybe a comment to the effect of "Rmerge is now deprecated; use Rmeas" 
would be useful as well. Would something catastrophic happen?

All the best,

Jacob Keller

*******************************************
Jacob Pearson Keller, PhD
Research Scientist
HHMI Janelia Research Campus / Looger lab
Phone: (571)209-4000 x3159
Email:
[email protected]<mailto:[email protected]><ma
ilto:[email protected]>
*******************************************


--
This e-mail and any attachments may contain confidential, copyright and or 
privileged material, and are for the use of the intended addressee only. If you 
are not the intended addressee or an authorised recipient of the addressee 
please notify us of receipt by returning the e-mail and do not use, copy, 
retain, distribute or disclose the information in or attached to the e-mail.
Any opinions expressed within this e-mail are those of the individual and not 
necessarily of Diamond Light Source Ltd.
Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments 
are free from viruses and we cannot accept liability for any damage which you 
may sustain as a result of software viruses which may be transmitted in or with 
the message.
Diamond Light Source Limited (company no. 4375679).
Registered in England and Wales with its registered office
at Diamond House, Harwell Science and Innovation Campus,
Didcot, Oxfordshire, OX11 0DE, United Kingdom


--
John Berrisford
PDBe
European Bioinformatics Institute (EMBL-EBI)
European Molecular Biology Laboratory
Wellcome Trust Genome Campus
Hinxton
Cambridge CB10 1SD UK
Tel: +44 1223 492529

http://www.pdbe.org
http://www.facebook.com/proteindatabank
http://twitter.com/PDBeurope

Re: [ccp4bb] AW: [ccp4bb] Rmergicide Through Programming

Reply via email to