Hi David,

Why not do all those things with Rwork? It is much less noisy than Rfree. Have you ever seen a case in such analysis where Rwork didn't tell you the same thing Rfree did?  If so, did you believe the difference?

Once when I was playing with lossy image compression if I picked just the right compression ratio I could get slightly better Rfree. But that is not something I'd recommend as a good idea.

-James Holton
MAD Scientist

On 11/1/2021 2:22 AM, David Waterman wrote:
Hi James,

What you wrote makes lots of sense. I had not heard about Rsleep, so that looks like interesting reading, thanks.

I have often used Rfree as a simple tool to compare two protocols. If I am not actually optimising against Rfree but just using it for a one-off comparison then that is okay, right?

Let's say I have two data processing protocols, A and B. Between these I might be exploring some difference in options within one data processing program, perhaps different geometry refinement parameters, or scaling options. I expect the A and B data sets to be quite similar, but I would like to evaluate which protocol was "better", and I want to do this quickly, ideally looking at a single number. I don't like I/sigI because I don't trust the sigmas, CC1/2 is often noisy, and I'm totally sworn off merging R statistics for these purposes. I tend to use Rfree as an easily-available metric, independent from the data processing program and the merging stats. It also allows a comparison of A and B in terms of the "product" of crystallography, namely the refined structure. In this I am lucky because I'm not trying to solve a structure. I may be looking at lysozyme or proteinase K: something where I can download a pretty good approximation to the truth from the PDB.

So, what I do is process the data by A and process by B, ensure the data sets have the same free set, then refine to convergence (or at least, a lot of cycles) starting from a PDB structure. I then evaluate A vs B in terms of Rfree, though without an error bar on Rfree I don't read too much into small differences.

Does this procedure seem sound? Perhaps it could be improved by randomly jiggling the atoms in the starting structure, in case the PDB deposition had already followed an A- or B-like protocol. Perhaps the whole approach is suspect. Certainly I wouldn't want to generalise by saying that A or B is better in all cases, but I do want to find a way to assess the various tweaks I can try in data processing for a single case.

Any thoughts? I appreciate the wisdom of the BB here.

Cheers

-- David


On Fri, 29 Oct 2021 at 15:50, James Holton <jmhol...@lbl.gov> wrote:


    Well, of all the possible metrics you could use to asses data
    quality Rfree is probably the worst one.  This is because it is a
    cross-validation metric, and cross-validations don't work if you
    use them as an optimization target. You can try, and might even
    make a little headway, but then your free set is burnt. If you
    have a third set of observations, as suggested for Rsleep
    (doi:10.1107/S0907444907033458), then you have a chance at another
    round of cross-validation. Crystallographers don't usually do
    this, but it has become standard practice in machine learning
    (training=Rwork, validation=Rfree and testing=Rsleep).

    So, unless you have an Rsleep set, any time you contemplate doing
    a bunch of random things and picking the best Rfree ... don't. 
    Just don't.  There madness lies.

    What happens after doing this is you will be initially happy about
    your lower Rfree, but everything you do after that will make it go
    up more than it would have had you not performed your Rfree
    optimization. This is because the changes in the data that made
    Rfree randomly better was actually noise, and as the structure
    becomes more correct it will move away from that noise. It's
    always better to optimize on something else, and then check your
    Rfree as infrequently as possible. Remember it is the control for
    your experiment. Never mix your positive control with your sample.

    As for the best metric to assess data quality?  Well, what are you
    doing with the data? There are always compromises in data
    processing and reduction that favor one application over another. 
    If this is a "I just want the structure" project, then score on
    the resolution where CC1/2 hits your favorite value. For some that
    is 0.5, others 0.3. I tend to use 0.0 so I can cut it later
    without re-processing. Whatever you do just make it consistent.

    If its for anomalous, score on CCanom or if that's too noisy the
    Imean/sigma in the lowest-angle resolution or highest-intensity
    bin. This is because for anomalous you want to minimize relative
    error. The end-all-be-all of anomalous signal strength is the
    phased anomalous difference Fourier. You need phases to do one,
    but if you have a structure just omit an anomalous scatterer of
    interest, refine to convergence, and then measure the peak height
    at the position of the omitted anomalous atom.  Instructions for
    doing anomalous refinement in refmac5 are here:
    
https://www2.mrc-lmb.cam.ac.uk/groups/murshudov/content/refmac/refmac_keywords.html

    If you're looking for a ligand you probably want isomorphism, and
    in that case refining with a reference structure looking for low
    Rwork is not a bad strategy. This will tend to select for crystals
    containing a molecule that looks like the one you are refining. 
    But be careful! If it is an apo structure your ligand-bound
    crystals will have higher Rwork due to the very difference density
    you are looking for.

    But if its the same data just being processed in different ways,
    first make a choice about what you are interested in, and then
    optimize on that.  just don't optimize on Rfree!

    -James Holton
    MAD Scientist


    On 10/27/2021 8:44 AM, Murpholino Peligro wrote:
    Let's say I ran autoproc with different combinations of options
    for a specific dataset, producing dozens of different (but not so
    different) mtz files...
    Then I ran phenix.refine with the same options for the same
    structure but with all my mtz zoo
    What would be the best metric to say "hey this combo works the
    best!"?
    R-free?
    Thanks

    M. Peligro

    ------------------------------------------------------------------------

    To unsubscribe from the CCP4BB list, click the following link:
    https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1
    <https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1>



    ------------------------------------------------------------------------

    To unsubscribe from the CCP4BB list, click the following link:
    https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1
    <https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1>


########################################################################

To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1

This message was issued to members of www.jiscmail.ac.uk/CCP4BB, a mailing list 
hosted by www.jiscmail.ac.uk, terms & conditions are available at 
https://www.jiscmail.ac.uk/policyandsecurity/

Reply via email to