Re: [ccp4bb] what would be the best metric to asses the quality of a mtz file?

Ian Tickle Wed, 03 Nov 2021 05:54:35 -0700

Hi, whilst I completely concur with James that Rfree is not a suitable
metric for this purpose for all the reasons he mentioned, it's not clear to
me that Rwork is much better.  If you really want to go down that route, even
better would be Rall, i.e. ignoring the free R flags, though I realise that
some programs use the test set to obtain unbiased sigmaA values, so that
may not be practical.  IMO Rall is hardly better for this purpose anyway.


None of the refinement R values are truly suitable for this purpose,
basically because they are strictly only "model selection" metrics (e.g.
see https://en.wikipedia.org/wiki/Model_selection?wprov=sfla1), i.e.
metrics used to select a mathematical model (a specific parameterisation of
the atomic model such as overall B, isotropic Bs, TLS, anisotropic tensors
etc.) from a set of candidate models, importantly assuming that the
identical data are used in each comparison.  Obviously, if both data and
model are changed one cannot sensibly perform comparisons.  Also because of
the danger of overfitting one would obviously not use Rwork or Rall for model
selection, rather Rfree in cross-validation.

Suppose you had to compare two datasets differing only in their
high-resolution cut-offs.  Now Rwork, Rfree and Rall will inevitably have
higher values at the high d* end, which means that if you apply a cut-off
at the high d* end all the overall R values will get smaller, so use of any
 R value as a data-quality metric will tend to result in selection of the
lowest-resolution dataset from your candidate set, which may well give a lower
quality map: not what you want!

I was faced with exactly this problem when designing an automated pipeline
to accept large numbers of fragment-screening datasets coming from the
synchrotrons, which tend to output a number of datasets for each crystal
processed in different ways.  If only the synchrotrons made the decision
for me and gave me only "the best" dataset (either unmerged or scaled and
merged) for each crystal !  So I needed a quick-and-dirty solution, and
obviously no atomic model was available at that stage in the processing, so
refinement R values were out of the question as a metric anyway.

All I did as a quick-and-dirty solution was to take the highest resolution
dataset from each crystal as the "best" one after applying appropriate standard
cut-offs based on mean I/sd(I), completeness and Rmeas.  So basically I was
using resolution as a data-quality metric, which makes a lot of sense to
me.  However, given a refined atomic model then the slow-and-clean method
would clearly be to examine the density, as others have suggested.

Cheers

-- Ian

On Wed, 3 Nov 2021, 00:05 Murpholino Peligro, <[email protected]> wrote:

> That's exactly what I am doing...
> citing David...
>
> "I expect the A and B data sets to be quite similar, but I would like to
> evaluate which protocol was "better", and I want to do this quickly,
> ideally looking at a single number."
>
> and
>
> "I do want to find a way to assess the various tweaks I can try in data
> processing for a single case"
>
> Why not do all those things with Rwork?
> I thought that comparing the R-free rather than the R-work was going to be
> easier.... Because last week the structure was dehydrated.... So the
> refinement program added "strong waters" and due to a thousand or so extra
> reflections I could have a dozen or so extra waters and the difference in
> R-work value between protocols due to extra waters was going to be a little
> bit more difficult to compare. I have now the final structure so I could
> very well compare the R-work doing another round of refinement, maybe
> randomizing adps at the beginning or something.
>
> Thanks a lot.
>
>
>
>
>
>
>
>
>
> El lun, 1 de nov. de 2021 a la(s) 03:22, David Waterman (
> [email protected]) escribió:
>
>> Hi James,
>>
>> What you wrote makes lots of sense. I had not heard about Rsleep, so that
>> looks like interesting reading, thanks.
>>
>> I have often used Rfree as a simple tool to compare two protocols. If I
>> am not actually optimising against Rfree but just using it for a one-off
>> comparison then that is okay, right?
>>
>> Let's say I have two data processing protocols, A and B. Between these I
>> might be exploring some difference in options within one data processing
>> program, perhaps different geometry refinement parameters, or scaling
>> options. I expect the A and B data sets to be quite similar, but I would
>> like to evaluate which protocol was "better", and I want to do this
>> quickly, ideally looking at a single number. I don't like I/sigI because I
>> don't trust the sigmas, CC1/2 is often noisy, and I'm totally sworn off
>> merging R statistics for these purposes. I tend to use Rfree as an
>> easily-available metric, independent from the data processing program and
>> the merging stats. It also allows a comparison of A and B in terms of the
>> "product" of crystallography, namely the refined structure. In this I am
>> lucky because I'm not trying to solve a structure. I may be looking at
>> lysozyme or proteinase K: something where I can download a pretty good
>> approximation to the truth from the PDB.
>>
>> So, what I do is process the data by A and process by B, ensure the data
>> sets have the same free set, then refine to convergence (or at least, a lot
>> of cycles) starting from a PDB structure. I then evaluate A vs B in terms
>> of Rfree, though without an error bar on Rfree I don't read too much into
>> small differences.
>>
>> Does this procedure seem sound? Perhaps it could be improved by randomly
>> jiggling the atoms in the starting structure, in case the PDB deposition
>> had already followed an A- or B-like protocol. Perhaps the whole approach
>> is suspect. Certainly I wouldn't want to generalise by saying that A or B
>> is better in all cases, but I do want to find a way to assess the various
>> tweaks I can try in data processing for a single case.
>>
>> Any thoughts? I appreciate the wisdom of the BB here.
>>
>> Cheers
>>
>> -- David
>>
>>
>> On Fri, 29 Oct 2021 at 15:50, James Holton <[email protected]> wrote:
>>
>>>
>>> Well, of all the possible metrics you could use to asses data quality
>>> Rfree is probably the worst one.  This is because it is a cross-validation
>>> metric, and cross-validations don't work if you use them as an optimization
>>> target. You can try, and might even make a little headway, but then your
>>> free set is burnt. If you have a third set of observations, as suggested
>>> for Rsleep (doi:10.1107/S0907444907033458), then you have a chance at
>>> another round of cross-validation. Crystallographers don't usually do this,
>>> but it has become standard practice in machine learning (training=Rwork,
>>> validation=Rfree and testing=Rsleep).
>>>
>>> So, unless you have an Rsleep set, any time you contemplate doing a
>>> bunch of random things and picking the best Rfree ... don't.  Just don't.
>>> There madness lies.
>>>
>>> What happens after doing this is you will be initially happy about your
>>> lower Rfree, but everything you do after that will make it go up more than
>>> it would have had you not performed your Rfree optimization. This is
>>> because the changes in the data that made Rfree randomly better was
>>> actually noise, and as the structure becomes more correct it will move away
>>> from that noise. It's always better to optimize on something else, and then
>>> check your Rfree as infrequently as possible. Remember it is the control
>>> for your experiment. Never mix your positive control with your sample.
>>>
>>> As for the best metric to assess data quality?  Well, what are you doing
>>> with the data? There are always compromises in data processing and
>>> reduction that favor one application over another.  If this is a "I just
>>> want the structure" project, then score on the resolution where CC1/2 hits
>>> your favorite value. For some that is 0.5, others 0.3. I tend to use 0.0 so
>>> I can cut it later without re-processing.  Whatever you do just make it
>>> consistent.
>>>
>>> If its for anomalous, score on CCanom or if that's too noisy the
>>> Imean/sigma in the lowest-angle resolution or highest-intensity bin. This
>>> is because for anomalous you want to minimize relative error. The
>>> end-all-be-all of anomalous signal strength is the phased anomalous
>>> difference Fourier. You need phases to do one, but if you have a structure
>>> just omit an anomalous scatterer of interest, refine to convergence, and
>>> then measure the peak height at the position of the omitted anomalous
>>> atom.  Instructions for doing anomalous refinement in refmac5 are here:
>>>
>>> https://www2.mrc-lmb.cam.ac.uk/groups/murshudov/content/refmac/refmac_keywords.html
>>>
>>> If you're looking for a ligand you probably want isomorphism, and in
>>> that case refining with a reference structure looking for low Rwork is not
>>> a bad strategy. This will tend to select for crystals containing a molecule
>>> that looks like the one you are refining.  But be careful! If it is an apo
>>> structure your ligand-bound crystals will have higher Rwork due to the very
>>> difference density you are looking for.
>>>
>>> But if its the same data just being processed in different ways, first
>>> make a choice about what you are interested in, and then optimize on that.
>>> just don't optimize on Rfree!
>>>
>>> -James Holton
>>> MAD Scientist
>>>
>>>
>>> On 10/27/2021 8:44 AM, Murpholino Peligro wrote:
>>>
>>> Let's say I ran autoproc with different combinations of options for a
>>> specific dataset, producing dozens of different (but not so different) mtz
>>> files...
>>> Then I ran phenix.refine with the same options for the same structure
>>> but with all my mtz zoo
>>> What would be the best metric to say "hey this combo works the best!"?
>>> R-free?
>>> Thanks
>>>
>>> M. Peligro
>>>
>>> ------------------------------
>>>
>>> To unsubscribe from the CCP4BB list, click the following link:
>>> https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1
>>>
>>>
>>>
>>> ------------------------------
>>>
>>> To unsubscribe from the CCP4BB list, click the following link:
>>> https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1
>>>
>>
>> ------------------------------
>>
>> To unsubscribe from the CCP4BB list, click the following link:
>> https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1
>>
>
> ------------------------------
>
> To unsubscribe from the CCP4BB list, click the following link:
> https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1
>

########################################################################

To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1

This message was issued to members of www.jiscmail.ac.uk/CCP4BB, a mailing list 
hosted by www.jiscmail.ac.uk, terms & conditions are available at 
https://www.jiscmail.ac.uk/policyandsecurity/

Re: [ccp4bb] what would be the best metric to asses the quality of a mtz file?

Reply via email to