Re: [ccp4bb] Resolution, R factors and data quality
On 1 September 2013 11:31, Frank von Delft frank.vonde...@sgc.ox.ac.ukwrote: 2. I'm struck by how small the improvements in R/Rfree are in Diederichs Karplus (ActaD 2013, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3689524/); the authors don't discuss it, but what's current thinking on how to estimate the expected variation in R/Rfree - does the Tickle formalism (1998) still apply for ML with very weak data? Frank, another point just occurred to me: the main reason for using Rfree as a model selection criterion is to detect overfitting in cases where you're comparing models with different numbers of parameters. That doesn't apply here since you're comparing the same model. In that case you would be much better off comparing Rwork since it has a much lower variance than Rfree (in fact lower by a factor of 19 if you use the usual 5% of reflections for the test set). Cheers -- Ian
Re: [ccp4bb] Resolution, R factors and data quality
Hi Frank and Ian, We struggled with the small changes in free R-factors when we implementing the paired refinement for resolution cut-offs in PDB_REDO. It's not just the lack of a proper test of significance for (weighted) R-factor changes, it's also a more philosophical problem. When should you reject a higher resolution cut-off? a) When it gives significantly higher R-factors (lenient) b) When it gives numerically higher R-factors (less lenient, but takes away the need for a significance test) c) When it does not give significantly lower R-factors (very strict; if I take X*sigma(R-free) as a cut-off, with X 1.0, in most cases I should reject the higher cut-off). PDB_REDO uses b), similar to Karplus and Diederichs in their Science paper. Then the next question is which metric are you going to use? R-free, weighted R-free, free log likelihood and CCfree are all written out by Refmac. At least the latter two have proper significance tests (likelihood ratios and transformation Z-scores respectively). Note that we use different models, constructed with different (but very much overlapping) data, but the metrics are calculated with the same data. The different metrics do not necessarily move in the same direction when moving to a higher resolution. We ended up using all 4 in PDB_REDO. By default a higher resolution cut-off is rejected if more than 1 metric gets (numerically) worse, but this can be changed by the user. Next question is the size of the resolution steps. How big should those be and how should they be set up? Karplus and Diederichs used equal steps in Angstrom, PDB_REDO uses equal steps in number of reflections. That way you add the same amount of data (but not usable information) with each step. Anyway, a different choice of steps will give a different final resolution cut-off. And the exact cut-off doesn't matter that much (see Evans and Murshudov). Different (versions of) refinement programs will probably also give somewhat different results. We tested our implementation on a number of structures in the PDB with data extending to higher resolution than marked in the PDB file and we observed that quite a lot had very conservative resolution cut-offs. In some cases we could use so much extra data that we could move to a more complex B-factor model and seriously improve R-factors. The best resolution cut-off is unclear and may change over time with improving methods. So whatever you choose, please deposit all the data that you can get even if you don't use it yourself. I think that the Karplus and Diederichs papers show us that you should at least realize that your resolution cut-off is a methodological choice that you should describe and should be able to defend if somebody asks you why you made that particular choice. Cheers, Robbie On 1 September 2013 11:31, Frank von Delft frank.vonde...@sgc.ox.ac.uk wrote: 2. I'm struck by how small the improvements in R/Rfree are in Diederichs Karplus (ActaD 2013, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3689524/ http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3689524/ ); the authors don't discuss it, but what's current thinking on how to estimate the expected variation in R/Rfree - does the Tickle formalism (1998) still apply for ML with very weak data? Frank, our paper is still relevant, unfortunately just not to the question you're trying to answer! We were trying to answer 2 questions: 1) what value of Rfree would you expect to get if the structure were free of systematic error and only random errors were present, so that could be used as a baseline (assuming a fixed cross-validation test set) to identify models with gross (e.g. chain-tracing) errors; and 2) how much would you expect Rfree to vary assuming a fixed starting model but with a different random sampling of the test set (i.e. the sampling standard deviation). The latter is relevant if say you want to compare the same structure (at the same resolution obviously) done independently in 2 labs, since it tells you how big the difference in Rfree for an arbitrary choice of test set needs to be before you can claim that it's statistically significant. In this case the questions are different because you're certainly not comparing different models using the same test set, neither I suspect are you comparing the same model with different randomly selected test sets. I assume in this case that the test sets for different resolution cut-offs are highly correlated, which I suspect makes it quite difficult to say what is a significant difference in Rfree (I have not attempted to do the algebra!). Rfree is one of a number of model selection criteria (see http://en.wikipedia.org/wiki/Model_selection#Criteria_for_model_selectio n) whose purpose is to provide a metric for comparison of different models given specific data, i.e. as for the likelihood function they all take the form f(model | data), so in all cases you're varying
Re: [ccp4bb] Resolution, R factors and data quality
A bit late to this thread. 1. Juergen: Jim was not actually adopting CC*, he was asking how to make practical use of it when faced with actual datasets fading into noise. If I understand correctly from later responses, paired refinement is what KD suggest should be best practice? 2. I'm struck by how small the improvements in R/Rfree are in Diederichs Karplus (ActaD 2013,http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3689524/ http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3689524/); the authors don't discuss it, but what's current thinking on how to estimate the expected variation in R/Rfree - does the Tickle formalism (1998) still apply for ML with very weak data? I'm puzzled by Table 4 (and discussion): do I read correctly that discarding negative unique reflections led to higher CCwork/CCfree? Wasn't the point of the paper that massaging data always shows up in worse refinement stats? Is this a corner case, and how would one know? Cheers phx On 28/08/2013 01:48, Bosch, Juergen wrote: Hi Jim, all data is good data - the more data you have the better (that's what they say anyhow) Not everybody is adopting to the Karplus Diederich paper as quickly as you do. And not to be confused with the Diederichs and Karplus paper :-) http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3689524/ http://www.ncbi.nlm.nih.gov/pubmed/22628654 My models get better by including the data I had been omitting before, that's all that counts for me. Jürgen P.S. reminds me somehow of those guys collecting more and more data - PRISM greetings On Aug 27, 2013, at 8:29 PM, Jim Pflugrath wrote: I have to ask flamingly: So what about CC1/2 and CC*? Did we not replace an arbitrary resolution cut-off based on a value of Rmerge with an arbitrary resolution cut-off based on a value of Rmeas already? And now we are going to replace that with an arbitrary resolution cut-off based on a value of CC* or is it CC1/2? I am asked often: What value of CC1/2 should I cut my resolution at? What should I tell my students? I've got a course coming up and I am sure they will ask me again. Jim *From:* CCP4 bulletin board [CCP4BB@JISCMAIL.AC.UK mailto:CCP4BB@JISCMAIL.AC.UK] on behalf of Arka Chakraborty [arko.chakrabort...@gmail.com mailto:arko.chakrabort...@gmail.com] *Sent:* Tuesday, August 27, 2013 7:45 AM *To:* CCP4BB@JISCMAIL.AC.UK mailto:CCP4BB@JISCMAIL.AC.UK *Subject:* Re: [ccp4bb] Resolution, R factors and data quality Hi all, does this not again bring up the still prevailing adherence to R factors and not a shift to correlation coefficients ( CC1/2 and CC*) ? (as Dr. Phil Evans has indicated).? The way we look at data quality ( by we I mean the end users ) needs to be altered, I guess. best, Arka Chakraborty On Tue, Aug 27, 2013 at 9:50 AM, Phil Evans p...@mrc-lmb.cam.ac.uk mailto:p...@mrc-lmb.cam.ac.uk wrote: The question you should ask yourself is why would omitting data improve my model? Phil .. Jürgen Bosch Johns Hopkins University Bloomberg School of Public Health Department of Biochemistry Molecular Biology Johns Hopkins Malaria Research Institute 615 North Wolfe Street, W8708 Baltimore, MD 21205 Office: +1-410-614-4742 Lab: +1-410-614-4894 Fax: +1-410-955-2926 http://lupo.jhsph.edu
Re: [ccp4bb] Resolution, R factors and data quality
On 1 September 2013 11:31, Frank von Delft frank.vonde...@sgc.ox.ac.ukwrote: 2. I'm struck by how small the improvements in R/Rfree are in Diederichs Karplus (ActaD 2013, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3689524/); the authors don't discuss it, but what's current thinking on how to estimate the expected variation in R/Rfree - does the Tickle formalism (1998) still apply for ML with very weak data? Frank, our paper is still relevant, unfortunately just not to the question you're trying to answer! We were trying to answer 2 questions: 1) what value of Rfree would you expect to get if the structure were free of systematic error and only random errors were present, so that could be used as a baseline (assuming a fixed cross-validation test set) to identify models with gross (e.g. chain-tracing) errors; and 2) how much would you expect Rfree to vary assuming a fixed starting model but with a different random sampling of the test set (i.e. the sampling standard deviation). The latter is relevant if say you want to compare the same structure (at the same resolution obviously) done independently in 2 labs, since it tells you how big the difference in Rfree for an arbitrary choice of test set needs to be before you can claim that it's statistically significant. In this case the questions are different because you're certainly not comparing different models using the same test set, neither I suspect are you comparing the same model with different randomly selected test sets. I assume in this case that the test sets for different resolution cut-offs are highly correlated, which I suspect makes it quite difficult to say what is a significant difference in Rfree (I have not attempted to do the algebra!). Rfree is one of a number of model selection criteria (see http://en.wikipedia.org/wiki/Model_selection#Criteria_for_model_selection) whose purpose is to provide a metric for comparison of different models given specific data, i.e. as for the likelihood function they all take the form f(model | data), so in all cases you're varying the model with fixed data. It's use in the form f(data | model), i.e. where you're varying the data with a fixed model I would say is somewhat questionable and certainly requires careful analysis to determine whether the results are statistically significant. Even assuming we can argue our way around the inappropriate application of model selection methodology to a different problem, unfortunately Rfree is far from an ideal criterion in this respect; a better one would surely be the free log-likelihood as originally proposed by Gerard Bricogne. Cheers -- Ian
Re: [ccp4bb] Resolution, R factors and data quality
Hi Bernhard, snip But the real objective is – where do data stop making an improvement to the model. The categorical statement that all data is good is simply not true in practice. It is probably specific to each data set refinement, and as long as we do not always run paired refinement ala KD or similar in order to find out where that point is, the yearning for a simple number will not stop (although I believe automation will make the KD approach or similar eventually routine). For what it is worth: This is already implemented in PDB_REDO. Cheers, Robbie
Re: [ccp4bb] Resolution, R factors and data quality
Based on the simulations I've done the data should be cut at CC1/2 = 0. Seriously. Problem is figuring out where it hits zero. But the real objective is – where do data stop making an improvement to the model. The categorical statement that all data is good is simply not true in practice. It is probably specific to each data set refinement, and as long as we do not always run paired refinement ala KD or similar in order to find out where that point is, the yearning for a simple number will not stop (although I believe automation will make the KD approach or similar eventually routine). As for the resolution of the structure I'd say call that where |Fo-Fc| (error in the map) becomes comparable to Sigma(Fo). This is I/Sigma = 2.5 if Rcryst is 20%. That is: |Fo-Fc| / Fo = 0.2, which implies |Io-Ic|/Io = 0.4 or Io/|Io-Ic| = Io/sigma(Io) = 2.5. Makes sense to me... As long as it is understood that this ‘model resolution value’ derived via your argument from I/sigI is not the same as a I/sigI data cutoff (and that Rcryst and Rmerge have nothing in common)…. -James Holton MAD Scientist Best, BR On Aug 27, 2013, at 5:29 PM, Jim Pflugrath mailto:jim.pflugr...@rigaku.com jim.pflugr...@rigaku.com wrote: I have to ask flamingly: So what about CC1/2 and CC*? Did we not replace an arbitrary resolution cut-off based on a value of Rmerge with an arbitrary resolution cut-off based on a value of Rmeas already? And now we are going to replace that with an arbitrary resolution cut-off based on a value of CC* or is it CC1/2? I am asked often: What value of CC1/2 should I cut my resolution at? What should I tell my students? I've got a course coming up and I am sure they will ask me again. Jim _ From: CCP4 bulletin board [CCP4BB@JISCMAIL.AC.UK] on behalf of Arka Chakraborty [arko.chakrabort...@gmail.com] Sent: Tuesday, August 27, 2013 7:45 AM To: CCP4BB@JISCMAIL.AC.UK Subject: Re: [ccp4bb] Resolution, R factors and data quality Hi all, does this not again bring up the still prevailing adherence to R factors and not a shift to correlation coefficients ( CC1/2 and CC*) ? (as Dr. Phil Evans has indicated).? The way we look at data quality ( by we I mean the end users ) needs to be altered, I guess. best, Arka Chakraborty On Tue, Aug 27, 2013 at 9:50 AM, Phil Evans p...@mrc-lmb.cam.ac.uk wrote: The question you should ask yourself is why would omitting data improve my model? Phil
Re: [ccp4bb] Resolution, R factors and data quality
We don't currently have a really good measure of that point where adding the extra shell of data adds significant information (whatever that means. However, my rough trials (see http://www.ncbi.nlm.nih.gov/pubmed/23793146) suggested that the exact cutoff point was not very critical, presumably as the information content fades out slowly, so it probably isn't something to agonise over too much. K D's paired refinement may be useful though. I would again caution against looking too hard at CC* rather than CC1/2: they are exactly equivalent, but CC* changes very rapidly at small values, which may be misleading. The purpose of CC* is for comparison with CCcryst (i.e. Fo to Fc). I would remind any users of Scala who want to look back at old log files to see the statistics for the outer shell at the cutoff they used, that CC1/2 has been calculated in Scala for many years under the name CC_IMEAN. It's now called CC1/2 in Aimless (and Scala) following Kai's excellent suggestion. Phil On 28 Aug 2013, at 08:21, Bernhard Rupp hofkristall...@gmail.com wrote: Based on the simulations I've done the data should be cut at CC1/2 = 0. Seriously. Problem is figuring out where it hits zero. But the real objective is – where do data stop making an improvement to the model. The categorical statement that all data is good is simply not true in practice. It is probably specific to each data set refinement, and as long as we do not always run paired refinement ala KD or similar in order to find out where that point is, the yearning for a simple number will not stop (although I believe automation will make the KD approach or similar eventually routine). As for the resolution of the structure I'd say call that where |Fo-Fc| (error in the map) becomes comparable to Sigma(Fo). This is I/Sigma = 2.5 if Rcryst is 20%. That is: |Fo-Fc| / Fo = 0.2, which implies |Io-Ic|/Io = 0.4 or Io/|Io-Ic| = Io/sigma(Io) = 2.5. Makes sense to me... As long as it is understood that this ‘model resolution value’ derived via your argument from I/sigI is not the same as a I/sigI data cutoff (and that Rcryst and Rmerge have nothing in common)…. -James Holton MAD Scientist Best, BR On Aug 27, 2013, at 5:29 PM, Jim Pflugrath jim.pflugr...@rigaku.com wrote: I have to ask flamingly: So what about CC1/2 and CC*? Did we not replace an arbitrary resolution cut-off based on a value of Rmerge with an arbitrary resolution cut-off based on a value of Rmeas already? And now we are going to replace that with an arbitrary resolution cut-off based on a value of CC* or is it CC1/2? I am asked often: What value of CC1/2 should I cut my resolution at? What should I tell my students? I've got a course coming up and I am sure they will ask me again. Jim From: CCP4 bulletin board [CCP4BB@JISCMAIL.AC.UK] on behalf of Arka Chakraborty [arko.chakrabort...@gmail.com] Sent: Tuesday, August 27, 2013 7:45 AM To: CCP4BB@JISCMAIL.AC.UK Subject: Re: [ccp4bb] Resolution, R factors and data quality Hi all, does this not again bring up the still prevailing adherence to R factors and not a shift to correlation coefficients ( CC1/2 and CC*) ? (as Dr. Phil Evans has indicated).? The way we look at data quality ( by we I mean the end users ) needs to be altered, I guess. best, Arka Chakraborty On Tue, Aug 27, 2013 at 9:50 AM, Phil Evans p...@mrc-lmb.cam.ac.uk wrote: The question you should ask yourself is why would omitting data improve my model? Phil
Re: [ccp4bb] Resolution, R factors and data quality
Hi all, If I am not wrong, the Karplus Diederich paper suggests that data is generally meaningful upto CC1/2 value of 0.20 but they suggest a paired refinement technique ( pretty easy to perform) to actually decide on the resolution at which to cut the data. This will be the most prudent thing to do I guess and not follow any arbitrary value, as each data-set is different. But the fact remains that even where I/sigma(I) falls to 0.5 useful information remains which will improve the quality of the maps, and when discarded just leads us a bit further away from truth. However, as always, Dr Diederich and Karplus will be the best persons to comment on that ( as they have already done in the paper :) ) best, Arka Chakraborty p.s. Aimless seems to suggest a resolution limit bases on CC1/2=0.5 criterion ( which I guess is done to be on the safe side- Dr. Phil Evans can explain if there are other or an entirely different reason to it! ). But if we want to squeeze the most from our data-set, I guess we need to push a bit further sometimes :) On Wed, Aug 28, 2013 at 9:21 AM, Bernhard Rupp hofkristall...@gmail.comwrote: **Based on the simulations I've done the data should be cut at CC1/2 = 0. Seriously. Problem is figuring out where it hits zero. ** ** But the real objective is – where do data stop making an improvement to the model. The categorical statement that all data is good is simply not true in practice. It is probably specific to each data set refinement, and as long as we do not always run paired refinement ala KD** ** or similar in order to find out where that point is, the yearning for a simple number will not stop (although I believe automation will make the KD approach or similar eventually routine). ** ** As for the resolution of the structure I'd say call that where |Fo-Fc| (error in the map) becomes comparable to Sigma(Fo). This is I/Sigma = 2.5 if Rcryst is 20%. That is: |Fo-Fc| / Fo = 0.2, which implies |Io-Ic|/Io = 0.4 or Io/|Io-Ic| = Io/sigma(Io) = 2.5. ** ** Makes sense to me... ** ** As long as it is understood that this ‘model resolution value’ derived via your argument from I/sigI is not the same as a I/sigI data cutoff (and that Rcryst and Rmerge have nothing in common)…. ** ** -James Holton MAD Scientist ** ** Best, BR ** ** ** ** On Aug 27, 2013, at 5:29 PM, Jim Pflugrath jim.pflugr...@rigaku.com wrote: I have to ask flamingly: So what about CC1/2 and CC*? ** ** Did we not replace an arbitrary resolution cut-off based on a value of Rmerge with an arbitrary resolution cut-off based on a value of Rmeas already? And now we are going to replace that with an arbitrary resolution cut-off based on a value of CC* or is it CC1/2? ** ** I am asked often: What value of CC1/2 should I cut my resolution at? What should I tell my students? I've got a course coming up and I am sure they will ask me again. ** ** Jim ** ** -- *From:* CCP4 bulletin board [CCP4BB@JISCMAIL.AC.UK] on behalf of Arka Chakraborty [arko.chakrabort...@gmail.com] *Sent:* Tuesday, August 27, 2013 7:45 AM *To:* CCP4BB@JISCMAIL.AC.UK *Subject:* Re: [ccp4bb] Resolution, R factors and data quality Hi all, does this not again bring up the still prevailing adherence to R factors and not a shift to correlation coefficients ( CC1/2 and CC*) ? (as Dr. Phil Evans has indicated).? The way we look at data quality ( by we I mean the end users ) needs to be altered, I guess. best, ** ** Arka Chakraborty ** ** On Tue, Aug 27, 2013 at 9:50 AM, Phil Evans p...@mrc-lmb.cam.ac.uk wrote: The question you should ask yourself is why would omitting data improve my model? Phil -- *Arka Chakraborty* *ibmb (Institut de Biologia Molecular de Barcelona)** **BARCELONA, SPAIN** *
Re: [ccp4bb] Resolution, R factors and data quality
Aimless does indeed calculate the point at which CC1/2 falls below 0.5 but I would not necessarily suggest that as the best cutoff point. Personally I would also look at I/sigI, anisotropy and completeness, but as I said at that point I don't think it makes a huge difference Phil On 28 Aug 2013, at 10:00, Arka Chakraborty arko.chakrabort...@gmail.com wrote: Hi all, If I am not wrong, the Karplus Diederich paper suggests that data is generally meaningful upto CC1/2 value of 0.20 but they suggest a paired refinement technique ( pretty easy to perform) to actually decide on the resolution at which to cut the data. This will be the most prudent thing to do I guess and not follow any arbitrary value, as each data-set is different. But the fact remains that even where I/sigma(I) falls to 0.5 useful information remains which will improve the quality of the maps, and when discarded just leads us a bit further away from truth. However, as always, Dr Diederich and Karplus will be the best persons to comment on that ( as they have already done in the paper :) ) best, Arka Chakraborty p.s. Aimless seems to suggest a resolution limit bases on CC1/2=0.5 criterion ( which I guess is done to be on the safe side- Dr. Phil Evans can explain if there are other or an entirely different reason to it! ). But if we want to squeeze the most from our data-set, I guess we need to push a bit further sometimes :) On Wed, Aug 28, 2013 at 9:21 AM, Bernhard Rupp hofkristall...@gmail.com wrote: Based on the simulations I've done the data should be cut at CC1/2 = 0. Seriously. Problem is figuring out where it hits zero. But the real objective is – where do data stop making an improvement to the model. The categorical statement that all data is good is simply not true in practice. It is probably specific to each data set refinement, and as long as we do not always run paired refinement ala KD or similar in order to find out where that point is, the yearning for a simple number will not stop (although I believe automation will make the KD approach or similar eventually routine). As for the resolution of the structure I'd say call that where |Fo-Fc| (error in the map) becomes comparable to Sigma(Fo). This is I/Sigma = 2.5 if Rcryst is 20%. That is: |Fo-Fc| / Fo = 0.2, which implies |Io-Ic|/Io = 0.4 or Io/|Io-Ic| = Io/sigma(Io) = 2.5. Makes sense to me... As long as it is understood that this ‘model resolution value’ derived via your argument from I/sigI is not the same as a I/sigI data cutoff (and that Rcryst and Rmerge have nothing in common)…. -James Holton MAD Scientist Best, BR On Aug 27, 2013, at 5:29 PM, Jim Pflugrath jim.pflugr...@rigaku.com wrote: I have to ask flamingly: So what about CC1/2 and CC*? Did we not replace an arbitrary resolution cut-off based on a value of Rmerge with an arbitrary resolution cut-off based on a value of Rmeas already? And now we are going to replace that with an arbitrary resolution cut-off based on a value of CC* or is it CC1/2? I am asked often: What value of CC1/2 should I cut my resolution at? What should I tell my students? I've got a course coming up and I am sure they will ask me again. Jim From: CCP4 bulletin board [CCP4BB@JISCMAIL.AC.UK] on behalf of Arka Chakraborty [arko.chakrabort...@gmail.com] Sent: Tuesday, August 27, 2013 7:45 AM To: CCP4BB@JISCMAIL.AC.UK Subject: Re: [ccp4bb] Resolution, R factors and data quality Hi all, does this not again bring up the still prevailing adherence to R factors and not a shift to correlation coefficients ( CC1/2 and CC*) ? (as Dr. Phil Evans has indicated).? The way we look at data quality ( by we I mean the end users ) needs to be altered, I guess. best, Arka Chakraborty On Tue, Aug 27, 2013 at 9:50 AM, Phil Evans p...@mrc-lmb.cam.ac.uk wrote: The question you should ask yourself is why would omitting data improve my model? Phil -- Arka Chakraborty ibmb (Institut de Biologia Molecular de Barcelona) BARCELONA, SPAIN
Re: [ccp4bb] Resolution, R factors and data quality
We don't currently have a really good measure of that point where adding the extra shell of data adds significant information so it probably isn't something to agonise over too much. K D's paired refinement may be useful though. That seems to be a correct assessment of the situation and a forceful argument to eliminate the review nonsense of nitpicking on I/sigI values, associated R-merges, and other pseudo-statistics once and for good. We can now, thanks to data deposition, at any time generate or download the maps and the models and judge for ourselves even minute details of local model quality from there. As far as use and interpretation goes, when the model meets the map is where the rubber meets the road. I therefore make the heretic statement that the entire table 1 of data collection statistics, justifiable in pre-deposition times as some means to guess structure quality can go the way of X-ray film and be almost always eliminated from papers. There is nothing really useful in Table 1, and all its data items and more are in the PDB header anyhow. Availability of maps for review and for users is the key point. Cheers, BR
Re: [ccp4bb] Resolution, R factors and data quality
What a statement ! Give reviewers maps, I agree however, what if the reviewer has no clue of these things we call structures ? I think for those people table 1 might still provide some justification. I would argue it should go into the supplement at least. Jürgen Sent from my iPad On Aug 28, 2013, at 5:58, Bernhard Rupp hofkristall...@gmail.com wrote: We don't currently have a really good measure of that point where adding the extra shell of data adds significant information so it probably isn't something to agonise over too much. K D's paired refinement may be useful though. That seems to be a correct assessment of the situation and a forceful argument to eliminate the review nonsense of nitpicking on I/sigI values, associated R-merges, and other pseudo-statistics once and for good. We can now, thanks to data deposition, at any time generate or download the maps and the models and judge for ourselves even minute details of local model quality from there. As far as use and interpretation goes, when the model meets the map is where the rubber meets the road. I therefore make the heretic statement that the entire table 1 of data collection statistics, justifiable in pre-deposition times as some means to guess structure quality can go the way of X-ray film and be almost always eliminated from papers. There is nothing really useful in Table 1, and all its data items and more are in the PDB header anyhow. Availability of maps for review and for users is the key point. Cheers, BR
Re: [ccp4bb] Resolution, R factors and data quality
Hi, a random thought: the data resolution, d_min_actual, can be thought of as such that maximizes the correlation (*) between the synthesis calculated using your data and an equivalent Fmodel synthesis calculated using complete set of Miller indices in d_min_actual-inf resolution range, where d_min=d_min_actual and d_min is the highest resolution of data set in question. Makes sense to me.. (*) or any other more appropriate similarity measure: usual map CC may not be the best one in this context. Pavel On Tue, Aug 27, 2013 at 5:45 AM, Arka Chakraborty arko.chakrabort...@gmail.com wrote: Hi all, does this not again bring up the still prevailing adherence to R factors and not a shift to correlation coefficients ( CC1/2 and CC*) ? (as Dr. Phil Evans has indicated).? The way we look at data quality ( by we I mean the end users ) needs to be altered, I guess. best, Arka Chakraborty On Tue, Aug 27, 2013 at 9:50 AM, Phil Evans p...@mrc-lmb.cam.ac.uk wrote: The question you should ask yourself is why would omitting data improve my model? Phil On 27 Aug 2013, at 02:49, Emily Golden 10417...@student.uwa.edu.au wrote: Hi All, I have collected diffraction images to 1 Angstrom resolution to the edge of the detector and 0.9A to the corner.I collected two sets, one for low resolution reflections and one for high resolution reflections. I get 100% completeness above 1A and 41% completeness in the 0.9A-0.95A shell. However, my Rmerge in the highest shelll is not good, ~80%. The Rfree is 0.17 and Rwork is 0.16 but the maps look very good. If I cut the data to 1 Angstrom the R factors improve but I feel the maps are not as good and I'm not sure if I can justify cutting data. So my question is, should I cut the data to 1Angstrom or should I keep the data I have? Also, taking geometric restraints off during refinement the Rfactors improve marginally, am I justified in doing this at this resolution? Thank you, Emily -- *Arka Chakraborty* *ibmb (Institut de Biologia Molecular de Barcelona)** **BARCELONA, SPAIN** *
Re: [ccp4bb] Resolution, R factors and data quality
what if the reviewer has no clue of these things we call structures ? I think for those people table 1 might still provide some justification. Someone who knows little about structures probably won’t appreciate the technical details in Table 1 either J rgen Sent from my iPad On Aug 28, 2013, at 5:58, Bernhard Rupp hofkristall...@gmail.com wrote: We don't currently have a really good measure of that point where adding the extra shell of data adds significant information so it probably isn't something to agonise over too much. K D's paired refinement may be useful though. That seems to be a correct assessment of the situation and a forceful argument to eliminate the review nonsense of nitpicking on I/sigI values, associated R-merges, and other pseudo-statistics once and for good. We can now, thanks to data deposition, at any time generate or download the maps and the models and judge for ourselves even minute details of local model quality from there. As far as use and interpretation goes, when the model meets the map is where the rubber meets the road. I therefore make the heretic statement that the entire table 1 of data collection statistics, justifiable in pre-deposition times as some means to guess structure quality can go the way of X-ray film and be almost always eliminated from papers. There is nothing really useful in Table 1, and all its data items and more are in the PDB header anyhow. Availability of maps for review and for users is the key point. Cheers, BR
Re: [ccp4bb] Resolution, R factors and data quality
Jim, This is coming from someone who just got enlightened a few weeks ago on resolution cut-offs. I am asked often: What value of CC1/2 should I cut my resolution at? The KD paper mentioned that the CC(1/2) criterion loses its significance at ~9 according to student test. I doubt that this can be a generally true guideline for a resolution cut-off. The structures I am doing right now were cut off at ~20 to ~80 CC(1/2) You probably do not want to do the same mistake again, we all made before, when cutting resolution based on Rmerge/Rmeas, do you? What should I tell my students? I've got a course coming up and I am sure they will ask me again. This is actually the more valuable insight I got from the KD paper. You don't use the CC(1/2) as an absolute indicator but rather as an suggestion. The resolution limit is determined by the refinement, not by the data processing. I think I will handle my data in future as follows: Bins with CC(1/2) less than 9 should be initially excluded. The structure is then refined against all reflections in the file and only those bins that add information to the map/structure are kept in the final rounds. In most cases this will probably be more than CC(1/2) 25. If the last shell (CC~9) still adds information to the model, process the images again, e.g. till CC(1/2) drops to 0, and see if some more useful information is in there. You could also go ahead and use CC(1/2) 0 as initial cut-off, but I think that will rather increase computation time than help your structure in most cases. So yes, I would feel comfortable with giving true resolution limits based on the refinement of the model, and not based on any number derived from data processing. In the end, you can always say I tried it and this was the highest resolution I could model vs. I cut at _numerical value X of this parameter_ because everybody else does so.
Re: [ccp4bb] Resolution, R factors and data quality
Excellent point about R-factors. Indeed, at this resolution they should be quite lower than what you have. Did you: - model solvent? - use anisotropic ADPs? - add H (this alone can drop R by 1-2%)? - model alternative conformations? - How R-factors (Rwork) look in resolution? Pavel On Mon, Aug 26, 2013 at 10:47 PM, Emily Golden 10417...@student.uwa.edu.auwrote: Thanks Yuriy and Pavel, at this resolution one would expect R/Rfree to be ~ 10-11%/12-13% assuming you applied anisotropic B-factor refinement ( and probably having a low symmetry SG). R merge of 80% may be OK if I/sig for high res shell is 2. Yes, I used anisotropic Bfactors and the space group is P1 21 1. However, the I/sig is only 1.5 in the highest shell. Cutting the data such that the I/sig is 2 has improved the R factors. Thank you. Maps get worse Could it be when you use all resolution range you get 59% of missing reflections in highest resolution shell filled in with DFc for the purpose of map calculation? Yes! the map that I was looking at was filled. Emily On 27 August 2013 09:49, Emily Golden 10417...@student.uwa.edu.au wrote: Hi All, I have collected diffraction images to 1 Angstrom resolution to the edge of the detector and 0.9A to the corner.I collected two sets, one for low resolution reflections and one for high resolution reflections. I get 100% completeness above 1A and 41% completeness in the 0.9A-0.95A shell. However, my Rmerge in the highest shelll is not good, ~80%. The Rfree is 0.17 and Rwork is 0.16 but the maps look very good. If I cut the data to 1 Angstrom the R factors improve but I feel the maps are not as good and I'm not sure if I can justify cutting data. So my question is, should I cut the data to 1Angstrom or should I keep the data I have? Also, taking geometric restraints off during refinement the Rfactors improve marginally, am I justified in doing this at this resolution? Thank you, Emily
Re: [ccp4bb] Resolution, R factors and data quality
Maybe a few remarks might help: Ad a) R merge of 80% may be OK if I/sig for high res shell is 2. What rationale is that statement based upon and what is the exact meaning of this statement? Is an Rmerge of 80% not ok when I/sigi is say 1.5? Or would 80% be ok if the i/sigI is 3.0? Why should an R-merge of 80% be (too) high in the first place? b) there is no statistical justification whatsoever for the I/sigI cutoff of 2 for refinement. This has been discussed @CCP4bb multiple times, for good reason. In this particular case, the (in)completeness appears to be the dominating factor. c) as Pavel notes, the R-value improvement means nil when truncating data - try to refine from 8 to 2 A and Rs might be even lower (abuse we engaged in ages ago when we did not know better and no ML) d) absolute values of refinement Rs vs (historic) expectation values cannot be judged without complete and detailed knowledge of the refinement protocol. The ultimate question is whether your model improves with inclusion of more data or not. Kay Diederichs has a few papers to this effect that make good reading. And CC1/2 seems to provide statistically justifiable limits for cut-off of (reasonably complete) high resolution shells. LG, BR From: CCP4 bulletin board [mailto:CCP4BB@JISCMAIL.AC.UK] On Behalf Of Emily Golden Sent: Dienstag, 27. August 2013 07:48 To: CCP4BB@JISCMAIL.AC.UK Subject: Re: [ccp4bb] Resolution, R factors and data quality Thanks Yuriy and Pavel, at this resolution one would expect R/Rfree to be ~ 10-11%/12-13% assuming you applied anisotropic B-factor refinement ( and probably having a low symmetry SG). R merge of 80% may be OK if I/sig for high res shell is 2. Yes, I used anisotropic Bfactors and the space group is P1 21 1. However, the I/sig is only 1.5 in the highest shell. Cutting the data such that the I/sig is 2 has improved the R factors. Thank you. Maps get worse Could it be when you use all resolution range you get 59% of missing reflections in highest resolution shell filled in with DFc for the purpose of map calculation? Yes! the map that I was looking at was filled. Emily On 27 August 2013 09:49, Emily Golden 10417...@student.uwa.edu.au wrote: Hi All, I have collected diffraction images to 1 Angstrom resolution to the edge of the detector and 0.9A to the corner.I collected two sets, one for low resolution reflections and one for high resolution reflections. I get 100% completeness above 1A and 41% completeness in the 0.9A-0.95A shell. However, my Rmerge in the highest shelll is not good, ~80%. The Rfree is 0.17 and Rwork is 0.16 but the maps look very good. If I cut the data to 1 Angstrom the R factors improve but I feel the maps are not as good and I'm not sure if I can justify cutting data. So my question is, should I cut the data to 1Angstrom or should I keep the data I have? Also, taking geometric restraints off during refinement the Rfactors improve marginally, am I justified in doing this at this resolution? Thank you, Emily
Re: [ccp4bb] Resolution, R factors and data quality
The question you should ask yourself is why would omitting data improve my model? Phil On 27 Aug 2013, at 02:49, Emily Golden 10417...@student.uwa.edu.au wrote: Hi All, I have collected diffraction images to 1 Angstrom resolution to the edge of the detector and 0.9A to the corner.I collected two sets, one for low resolution reflections and one for high resolution reflections. I get 100% completeness above 1A and 41% completeness in the 0.9A-0.95A shell. However, my Rmerge in the highest shelll is not good, ~80%. The Rfree is 0.17 and Rwork is 0.16 but the maps look very good. If I cut the data to 1 Angstrom the R factors improve but I feel the maps are not as good and I'm not sure if I can justify cutting data. So my question is, should I cut the data to 1Angstrom or should I keep the data I have? Also, taking geometric restraints off during refinement the Rfactors improve marginally, am I justified in doing this at this resolution? Thank you, Emily
Re: [ccp4bb] Resolution, R factors and data quality
Hi all, does this not again bring up the still prevailing adherence to R factors and not a shift to correlation coefficients ( CC1/2 and CC*) ? (as Dr. Phil Evans has indicated).? The way we look at data quality ( by we I mean the end users ) needs to be altered, I guess. best, Arka Chakraborty On Tue, Aug 27, 2013 at 9:50 AM, Phil Evans p...@mrc-lmb.cam.ac.uk wrote: The question you should ask yourself is why would omitting data improve my model? Phil On 27 Aug 2013, at 02:49, Emily Golden 10417...@student.uwa.edu.au wrote: Hi All, I have collected diffraction images to 1 Angstrom resolution to the edge of the detector and 0.9A to the corner.I collected two sets, one for low resolution reflections and one for high resolution reflections. I get 100% completeness above 1A and 41% completeness in the 0.9A-0.95A shell. However, my Rmerge in the highest shelll is not good, ~80%. The Rfree is 0.17 and Rwork is 0.16 but the maps look very good. If I cut the data to 1 Angstrom the R factors improve but I feel the maps are not as good and I'm not sure if I can justify cutting data. So my question is, should I cut the data to 1Angstrom or should I keep the data I have? Also, taking geometric restraints off during refinement the Rfactors improve marginally, am I justified in doing this at this resolution? Thank you, Emily -- *Arka Chakraborty* *ibmb (Institut de Biologia Molecular de Barcelona)** **BARCELONA, SPAIN** *
Re: [ccp4bb] Resolution, R factors and data quality
I have to ask flamingly: So what about CC1/2 and CC*? Did we not replace an arbitrary resolution cut-off based on a value of Rmerge with an arbitrary resolution cut-off based on a value of Rmeas already? And now we are going to replace that with an arbitrary resolution cut-off based on a value of CC* or is it CC1/2? I am asked often: What value of CC1/2 should I cut my resolution at? What should I tell my students? I've got a course coming up and I am sure they will ask me again. Jim From: CCP4 bulletin board [CCP4BB@JISCMAIL.AC.UK] on behalf of Arka Chakraborty [arko.chakrabort...@gmail.com] Sent: Tuesday, August 27, 2013 7:45 AM To: CCP4BB@JISCMAIL.AC.UK Subject: Re: [ccp4bb] Resolution, R factors and data quality Hi all, does this not again bring up the still prevailing adherence to R factors and not a shift to correlation coefficients ( CC1/2 and CC*) ? (as Dr. Phil Evans has indicated).? The way we look at data quality ( by we I mean the end users ) needs to be altered, I guess. best, Arka Chakraborty On Tue, Aug 27, 2013 at 9:50 AM, Phil Evans p...@mrc-lmb.cam.ac.ukmailto:p...@mrc-lmb.cam.ac.uk wrote: The question you should ask yourself is why would omitting data improve my model? Phil
Re: [ccp4bb] Resolution, R factors and data quality
Hi Jim, all data is good data - the more data you have the better (that's what they say anyhow) Not everybody is adopting to the Karplus Diederich paper as quickly as you do. And not to be confused with the Diederichs and Karplus paper :-) http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3689524/ http://www.ncbi.nlm.nih.gov/pubmed/22628654 My models get better by including the data I had been omitting before, that's all that counts for me. Jürgen P.S. reminds me somehow of those guys collecting more and more data - PRISM greetings On Aug 27, 2013, at 8:29 PM, Jim Pflugrath wrote: I have to ask flamingly: So what about CC1/2 and CC*? Did we not replace an arbitrary resolution cut-off based on a value of Rmerge with an arbitrary resolution cut-off based on a value of Rmeas already? And now we are going to replace that with an arbitrary resolution cut-off based on a value of CC* or is it CC1/2? I am asked often: What value of CC1/2 should I cut my resolution at? What should I tell my students? I've got a course coming up and I am sure they will ask me again. Jim From: CCP4 bulletin board [CCP4BB@JISCMAIL.AC.UKmailto:CCP4BB@JISCMAIL.AC.UK] on behalf of Arka Chakraborty [arko.chakrabort...@gmail.commailto:arko.chakrabort...@gmail.com] Sent: Tuesday, August 27, 2013 7:45 AM To: CCP4BB@JISCMAIL.AC.UKmailto:CCP4BB@JISCMAIL.AC.UK Subject: Re: [ccp4bb] Resolution, R factors and data quality Hi all, does this not again bring up the still prevailing adherence to R factors and not a shift to correlation coefficients ( CC1/2 and CC*) ? (as Dr. Phil Evans has indicated).? The way we look at data quality ( by we I mean the end users ) needs to be altered, I guess. best, Arka Chakraborty On Tue, Aug 27, 2013 at 9:50 AM, Phil Evans p...@mrc-lmb.cam.ac.ukmailto:p...@mrc-lmb.cam.ac.uk wrote: The question you should ask yourself is why would omitting data improve my model? Phil .. Jürgen Bosch Johns Hopkins University Bloomberg School of Public Health Department of Biochemistry Molecular Biology Johns Hopkins Malaria Research Institute 615 North Wolfe Street, W8708 Baltimore, MD 21205 Office: +1-410-614-4742 Lab: +1-410-614-4894 Fax: +1-410-955-2926 http://lupo.jhsph.edu
Re: [ccp4bb] Resolution, R factors and data quality
Based on the simulations I've done the data should be cut at CC1/2 = 0. Seriously. Problem is figuring out where it hits zero. Alternately, if French Wilson can be modified so the Wilson plot is always straight, then the data don't need to be cut at all. As for the resolution of the structure I'd say call that where |Fo-Fc| (error in the map) becomes comparable to Sigma(Fo). This is I/Sigma = 2.5 if Rcryst is 20%. That is: |Fo-Fc| / Fo = 0.2, which implies |Io-Ic|/Io = 0.4 or Io/|Io-Ic| = Io/sigma(Io) = 2.5. Makes sense to me... -James Holton MAD Scientist On Aug 27, 2013, at 5:29 PM, Jim Pflugrath jim.pflugr...@rigaku.com wrote: I have to ask flamingly: So what about CC1/2 and CC*? Did we not replace an arbitrary resolution cut-off based on a value of Rmerge with an arbitrary resolution cut-off based on a value of Rmeas already? And now we are going to replace that with an arbitrary resolution cut-off based on a value of CC* or is it CC1/2? I am asked often: What value of CC1/2 should I cut my resolution at? What should I tell my students? I've got a course coming up and I am sure they will ask me again. Jim From: CCP4 bulletin board [CCP4BB@JISCMAIL.AC.UK] on behalf of Arka Chakraborty [arko.chakrabort...@gmail.com] Sent: Tuesday, August 27, 2013 7:45 AM To: CCP4BB@JISCMAIL.AC.UK Subject: Re: [ccp4bb] Resolution, R factors and data quality Hi all, does this not again bring up the still prevailing adherence to R factors and not a shift to correlation coefficients ( CC1/2 and CC*) ? (as Dr. Phil Evans has indicated).? The way we look at data quality ( by we I mean the end users ) needs to be altered, I guess. best, Arka Chakraborty On Tue, Aug 27, 2013 at 9:50 AM, Phil Evans p...@mrc-lmb.cam.ac.uk wrote: The question you should ask yourself is why would omitting data improve my model? Phil
[ccp4bb] Resolution, R factors and data quality
Hi All, I have collected diffraction images to 1 Angstrom resolution to the edge of the detector and 0.9A to the corner.I collected two sets, one for low resolution reflections and one for high resolution reflections. I get 100% completeness above 1A and 41% completeness in the 0.9A-0.95A shell. However, my Rmerge in the highest shelll is not good, ~80%. The Rfree is 0.17 and Rwork is 0.16 but the maps look very good. If I cut the data to 1 Angstrom the R factors improve but I feel the maps are not as good and I'm not sure if I can justify cutting data. So my question is, should I cut the data to 1Angstrom or should I keep the data I have? Also, taking geometric restraints off during refinement the Rfactors improve marginally, am I justified in doing this at this resolution? Thank you, Emily
Re: [ccp4bb] Resolution, R factors and data quality
Hi Emily, I get 100% completeness above 1A and 41% completeness in the 0.9A-0.95A shell. However, my Rmerge in the highest shelll is not good, ~80%. The Rfree is 0.17 and Rwork is 0.16 but the maps look very good. If I cut the data to 1 Angstrom the R factors improve but I feel the maps are not as good and I'm not sure if I can justify cutting data. You can't compare R-factors calculated using different sets of reflections. Maps get worse Could it be when you use all resolution range you get 59% of missing reflections in highest resolution shell filled in with DFc for the purpose of map calculation? Also, taking geometric restraints off during refinement the Rfactors improve marginally, am I justified in doing this at this resolution? It's unlikely you can refine without restraints at this resolution. Perhaps, without restraints the model still ok overall, but I would bet there are places that get badly distorted, so have a closer look at your model quality locally (alternative conformations, mobile loops, etc). Pavel
Re: [ccp4bb] Resolution, R factors and data quality
Thanks Yuriy and Pavel, at this resolution one would expect R/Rfree to be ~ 10-11%/12-13% assuming you applied anisotropic B-factor refinement ( and probably having a low symmetry SG). R merge of 80% may be OK if I/sig for high res shell is 2. Yes, I used anisotropic Bfactors and the space group is P1 21 1. However, the I/sig is only 1.5 in the highest shell. Cutting the data such that the I/sig is 2 has improved the R factors. Thank you. Maps get worse Could it be when you use all resolution range you get 59% of missing reflections in highest resolution shell filled in with DFc for the purpose of map calculation? Yes! the map that I was looking at was filled. Emily On 27 August 2013 09:49, Emily Golden 10417...@student.uwa.edu.au wrote: Hi All, I have collected diffraction images to 1 Angstrom resolution to the edge of the detector and 0.9A to the corner.I collected two sets, one for low resolution reflections and one for high resolution reflections. I get 100% completeness above 1A and 41% completeness in the 0.9A-0.95A shell. However, my Rmerge in the highest shelll is not good, ~80%. The Rfree is 0.17 and Rwork is 0.16 but the maps look very good. If I cut the data to 1 Angstrom the R factors improve but I feel the maps are not as good and I'm not sure if I can justify cutting data. So my question is, should I cut the data to 1Angstrom or should I keep the data I have? Also, taking geometric restraints off during refinement the Rfactors improve marginally, am I justified in doing this at this resolution? Thank you, Emily