Re: [ccp4bb] what would be the best metric to asses the quality of a mtz file?

James Holton Mon, 01 Nov 2021 21:11:45 -0700

Hi David,

Why not do all those things with Rwork? It is much less noisy thanRfree. Have you ever seen a case in such analysis where Rwork didn'ttell you the same thing Rfree did? If so, did you believe the difference?

Once when I was playing with lossy image compression if I picked justthe right compression ratio I could get slightly better Rfree. But thatis not something I'd recommend as a good idea.


-James Holton
MAD Scientist

On 11/1/2021 2:22 AM, David Waterman wrote:

Hi James,
What you wrote makes lots of sense. I had not heard about Rsleep, sothat looks like interesting reading, thanks.
I have often used Rfree as a simple tool to compare two protocols. IfI am not actually optimising against Rfree but just using it for aone-off comparison then that is okay, right?
Let's say I have two data processing protocols, A and B. Between theseI might be exploring some difference in options within one dataprocessing program, perhaps different geometry refinement parameters,or scaling options. I expect the A and B data sets to be quitesimilar, but I would like to evaluate which protocol was "better", andI want to do this quickly, ideally looking at a single number. I don'tlike I/sigI because I don't trust the sigmas, CC1/2 is often noisy,and I'm totally sworn off merging R statistics for these purposes. Itend to use Rfree as an easily-available metric, independent from thedata processing program and the merging stats. It also allows acomparison of A and B in terms of the "product" of crystallography,namely the refined structure. In this I am lucky because I'm nottrying to solve a structure. I may be looking at lysozyme orproteinase K: something where I can download a pretty goodapproximation to the truth from the PDB.
So, what I do is process the data by A and process by B, ensure thedata sets have the same free set, then refine to convergence (or atleast, a lot of cycles) starting from a PDB structure. I then evaluateA vs B in terms of Rfree, though without an error bar on Rfree I don'tread too much into small differences.
Does this procedure seem sound? Perhaps it could be improved byrandomly jiggling the atoms in the starting structure, in case the PDBdeposition had already followed an A- or B-like protocol. Perhaps thewhole approach is suspect. Certainly I wouldn't want to generalise bysaying that A or B is better in all cases, but I do want to find a wayto assess the various tweaks I can try in data processing for a singlecase.
Any thoughts? I appreciate the wisdom of the BB here.

Cheers

-- David


On Fri, 29 Oct 2021 at 15:50, James Holton <jmhol...@lbl.gov> wrote:


    Well, of all the possible metrics you could use to asses data
    quality Rfree is probably the worst one.  This is because it is a
    cross-validation metric, and cross-validations don't work if you
    use them as an optimization target. You can try, and might even
    make a little headway, but then your free set is burnt. If you
    have a third set of observations, as suggested for Rsleep
    (doi:10.1107/S0907444907033458), then you have a chance at another
    round of cross-validation. Crystallographers don't usually do
    this, but it has become standard practice in machine learning
    (training=Rwork, validation=Rfree and testing=Rsleep).

    So, unless you have an Rsleep set, any time you contemplate doing
    a bunch of random things and picking the best Rfree ... don't. 
    Just don't.  There madness lies.

    What happens after doing this is you will be initially happy about
    your lower Rfree, but everything you do after that will make it go
    up more than it would have had you not performed your Rfree
    optimization. This is because the changes in the data that made
    Rfree randomly better was actually noise, and as the structure
    becomes more correct it will move away from that noise. It's
    always better to optimize on something else, and then check your
    Rfree as infrequently as possible. Remember it is the control for
    your experiment. Never mix your positive control with your sample.

    As for the best metric to assess data quality?  Well, what are you
    doing with the data? There are always compromises in data
    processing and reduction that favor one application over another. 
    If this is a "I just want the structure" project, then score on
    the resolution where CC1/2 hits your favorite value. For some that
    is 0.5, others 0.3. I tend to use 0.0 so I can cut it later
    without re-processing. Whatever you do just make it consistent.

    If its for anomalous, score on CCanom or if that's too noisy the
    Imean/sigma in the lowest-angle resolution or highest-intensity
    bin. This is because for anomalous you want to minimize relative
    error. The end-all-be-all of anomalous signal strength is the
    phased anomalous difference Fourier. You need phases to do one,
    but if you have a structure just omit an anomalous scatterer of
    interest, refine to convergence, and then measure the peak height
    at the position of the omitted anomalous atom.  Instructions for
    doing anomalous refinement in refmac5 are here:
    
https://www2.mrc-lmb.cam.ac.uk/groups/murshudov/content/refmac/refmac_keywords.html

    If you're looking for a ligand you probably want isomorphism, and
    in that case refining with a reference structure looking for low
    Rwork is not a bad strategy. This will tend to select for crystals
    containing a molecule that looks like the one you are refining. 
    But be careful! If it is an apo structure your ligand-bound
    crystals will have higher Rwork due to the very difference density
    you are looking for.

    But if its the same data just being processed in different ways,
    first make a choice about what you are interested in, and then
    optimize on that.  just don't optimize on Rfree!

    -James Holton
    MAD Scientist


    On 10/27/2021 8:44 AM, Murpholino Peligro wrote:
    Let's say I ran autoproc with different combinations of options
    for a specific dataset, producing dozens of different (but not so
    different) mtz files...
    Then I ran phenix.refine with the same options for the same
    structure but with all my mtz zoo
    What would be the best metric to say "hey this combo works the
    best!"?
    R-free?
    Thanks

    M. Peligro

    ------------------------------------------------------------------------

    To unsubscribe from the CCP4BB list, click the following link:
    https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1
    <https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1>
    ------------------------------------------------------------------------

    To unsubscribe from the CCP4BB list, click the following link:
    https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1
    <https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1>


########################################################################

To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1

This message was issued to members of www.jiscmail.ac.uk/CCP4BB, a mailing list 
hosted by www.jiscmail.ac.uk, terms & conditions are available at 
https://www.jiscmail.ac.uk/policyandsecurity/

Re: [ccp4bb] what would be the best metric to asses the quality of a mtz file?

Reply via email to