Hi James,

What you wrote makes lots of sense. I had not heard about Rsleep, so that
looks like interesting reading, thanks.

I have often used Rfree as a simple tool to compare two protocols. If I am
not actually optimising against Rfree but just using it for a one-off
comparison then that is okay, right?

Let's say I have two data processing protocols, A and B. Between these I
might be exploring some difference in options within one data processing
program, perhaps different geometry refinement parameters, or scaling
options. I expect the A and B data sets to be quite similar, but I would
like to evaluate which protocol was "better", and I want to do this
quickly, ideally looking at a single number. I don't like I/sigI because I
don't trust the sigmas, CC1/2 is often noisy, and I'm totally sworn off
merging R statistics for these purposes. I tend to use Rfree as an
easily-available metric, independent from the data processing program and
the merging stats. It also allows a comparison of A and B in terms of the
"product" of crystallography, namely the refined structure. In this I am
lucky because I'm not trying to solve a structure. I may be looking at
lysozyme or proteinase K: something where I can download a pretty good
approximation to the truth from the PDB.

So, what I do is process the data by A and process by B, ensure the data
sets have the same free set, then refine to convergence (or at least, a lot
of cycles) starting from a PDB structure. I then evaluate A vs B in terms
of Rfree, though without an error bar on Rfree I don't read too much into
small differences.

Does this procedure seem sound? Perhaps it could be improved by randomly
jiggling the atoms in the starting structure, in case the PDB deposition
had already followed an A- or B-like protocol. Perhaps the whole approach
is suspect. Certainly I wouldn't want to generalise by saying that A or B
is better in all cases, but I do want to find a way to assess the various
tweaks I can try in data processing for a single case.

Any thoughts? I appreciate the wisdom of the BB here.

Cheers

-- David


On Fri, 29 Oct 2021 at 15:50, James Holton <jmhol...@lbl.gov> wrote:

>
> Well, of all the possible metrics you could use to asses data quality
> Rfree is probably the worst one.  This is because it is a cross-validation
> metric, and cross-validations don't work if you use them as an optimization
> target. You can try, and might even make a little headway, but then your
> free set is burnt. If you have a third set of observations, as suggested
> for Rsleep (doi:10.1107/S0907444907033458), then you have a chance at
> another round of cross-validation. Crystallographers don't usually do this,
> but it has become standard practice in machine learning (training=Rwork,
> validation=Rfree and testing=Rsleep).
>
> So, unless you have an Rsleep set, any time you contemplate doing a bunch
> of random things and picking the best Rfree ... don't.  Just don't.  There
> madness lies.
>
> What happens after doing this is you will be initially happy about your
> lower Rfree, but everything you do after that will make it go up more than
> it would have had you not performed your Rfree optimization. This is
> because the changes in the data that made Rfree randomly better was
> actually noise, and as the structure becomes more correct it will move away
> from that noise. It's always better to optimize on something else, and then
> check your Rfree as infrequently as possible. Remember it is the control
> for your experiment. Never mix your positive control with your sample.
>
> As for the best metric to assess data quality?  Well, what are you doing
> with the data? There are always compromises in data processing and
> reduction that favor one application over another.  If this is a "I just
> want the structure" project, then score on the resolution where CC1/2 hits
> your favorite value. For some that is 0.5, others 0.3. I tend to use 0.0 so
> I can cut it later without re-processing.  Whatever you do just make it
> consistent.
>
> If its for anomalous, score on CCanom or if that's too noisy the
> Imean/sigma in the lowest-angle resolution or highest-intensity bin. This
> is because for anomalous you want to minimize relative error. The
> end-all-be-all of anomalous signal strength is the phased anomalous
> difference Fourier. You need phases to do one, but if you have a structure
> just omit an anomalous scatterer of interest, refine to convergence, and
> then measure the peak height at the position of the omitted anomalous
> atom.  Instructions for doing anomalous refinement in refmac5 are here:
>
> https://www2.mrc-lmb.cam.ac.uk/groups/murshudov/content/refmac/refmac_keywords.html
>
> If you're looking for a ligand you probably want isomorphism, and in that
> case refining with a reference structure looking for low Rwork is not a bad
> strategy. This will tend to select for crystals containing a molecule that
> looks like the one you are refining.  But be careful! If it is an apo
> structure your ligand-bound crystals will have higher Rwork due to the very
> difference density you are looking for.
>
> But if its the same data just being processed in different ways, first
> make a choice about what you are interested in, and then optimize on that.
> just don't optimize on Rfree!
>
> -James Holton
> MAD Scientist
>
>
> On 10/27/2021 8:44 AM, Murpholino Peligro wrote:
>
> Let's say I ran autoproc with different combinations of options for a
> specific dataset, producing dozens of different (but not so different) mtz
> files...
> Then I ran phenix.refine with the same options for the same structure but
> with all my mtz zoo
> What would be the best metric to say "hey this combo works the best!"?
> R-free?
> Thanks
>
> M. Peligro
>
> ------------------------------
>
> To unsubscribe from the CCP4BB list, click the following link:
> https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1
>
>
>
> ------------------------------
>
> To unsubscribe from the CCP4BB list, click the following link:
> https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1
>

########################################################################

To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1

This message was issued to members of www.jiscmail.ac.uk/CCP4BB, a mailing list 
hosted by www.jiscmail.ac.uk, terms & conditions are available at 
https://www.jiscmail.ac.uk/policyandsecurity/

Reply via email to