Hi all, Here is a reference some people in our lab have been using to extract errors on NOE calculated from volumes.
Nicholson, Kay, Baldisseri, Arango, Young, Bax, and Torchia (1992) Biochemistry, 31: 5253-5263. And here is the interesting part... ====================== To estimate the error in NOE values, the standard deviation in baseline noise (a) was determined for each spectrum and expanded into a standard deviation in volume given by Delta V = sigma sqrt(nk) where n is the number of points in the f1 dimension and k is the number of points in the f2 dimension of a typical peak at the baseline. The error associated with the NOE value for peak j is then given by error_j = (V_A_j / V_B_j) * sqrt[(Delta_V_A / V_A_j)^2 + (Delta_V_B / V_B_j)^2] where V_A and V_B denote the volume of peak j in the presence and absense of NOE enhancement, respectively, and Delta_V_A and Delta_V_B denote the standard deviations in volume for the spectra recorded in the presence and absense of NOE enhancement, respectively. ====================== I'm not an expert in statistics, so won't judge the value of this approach. However, this is an approach some people have been using... Ciao ! Séb Edward d'Auvergne wrote: > Oh, I forgot about the std error formula. Is where the sqrt(2) comes > from? Doh, that would be retarded. Then I know someone who would > require sqrt(3) for the NOE spectra! Is that really what Palmer > meant, that std error is the same as "the standard deviation of the > differences between the heights of corresponding peaks in the paired > spectra" which "is equal to sqrt(2)*sigma" (Palmer et al., 1991)? > > I'm pretty sure though that the standard error is not the measure we > want for the confidence interval of the peak intensity. The reason is > because I think that the std error is a measure of how far the sample > mean is from the true mean (ignore this, this is a quick reference for > myself: http://en.wikipedia.org/wiki/Standard_error_(statistics) ). > (Warning, from here to the end of the paragraph is a rant!) This is > similar in concept to AIC model selection (see > http://en.wikipedia.org/wiki/Akaike_information_criterion and > http://en.wikipedia.org/wiki/Model_selection if you haven't heard > about the advanced statistical field of model selection before). AIC > is a little more advanced though as it estimates the Kullback-Leibler > discrepancy (http://en.wikipedia.org/wiki/Kullback–Leibler_divergence) > which is a measure of distance between the true distribution and the > back-calculated distribution using all information about the > distribution. Ok, that wasn't too relevant. Anyway, the std error as > a measure of the differences in means of 2 different distributions is > not a measure of the spread of either the true, measured, or > back-calculated distributions (or the 4th distribution, the > back-calculated from the fit to the true model). The std error is not > the confidence intervals of any of these 4 distributions, just the > difference between 2 of them using only a small part of the > information of those distributions. It's the statistical measure of > the difference in means of the true and measured distributions. As an > aside, for those completely lost now a clearer explanation of these 4 > distributions fundamental to data analysis, likelihood, discrepancies, > etc. can be read in section 2.2 of my PhD thesis at > http://dtl.unimelb.edu.au:80/R/-?func=dbin-jump-full&object_id=67077&current_base=GEN01 > or > http://www.amazon.com/Protein-Dynamics-Model-free-Analysis-Relaxation/dp/3639057627/ref=sr_1_6?ie=UTF8&s=books&qid=1219247007&sr=8-6 > (sorry for the blatant plug ;). Oh, the best way to picture all of > these concepts and the links between them is to draw and label 4 > distributions on a piece of paper on the same x and y-axes with not > too much overlap between them and connect them with arrows labelled > with all the weird terminology. > > Sorry, that was just a long way of saying that the std error is the > quality of how the sample mean matches the real mean and how the > standard deviation is the spread of the distribution. That being so, > I would avoid setting the standard error as the peak height > uncertainty. Maybe it would be best to do as you say Chris, and also > avoid the averaging of the replicated intensities. > > So, now to the dirty statistics required for the sparse sampling > caused by lack of NMR time. We can use the standard deviation formula > - the measure of the spread of the measured distribution which is the > estimator of the spread of the true distribution that optimisation > (which is chi-squared fitting and hence finding the maximal likelihood > (http://en.wikipedia.org/wiki/Maximum_likelihood)), Monte Carlo > simulations, and the frequentist model selection techniques all > utilise as the foundation for their fundamental derivations. We would > then square this to get the variance: > > sigma^2 = sum({Ii - Iav}^2) / (n - 1). > > In my previous posts I was saying sigma was the variance, but just > ignore that. Next we need to average the variance across all spins, > simply because the sample size is so low for each peak and hence the > error estimate is horrible. Whether this estimator of the true > variance is good or bad is debatable (well, actually, it's bad), but > it is unavoidable. It also has the obvious disadvantage in that the > peak height error is, in reality, different for each peak. > > Now, if not all spectra are replicated, then the approach needs to be > modified to give us errors for the peaks of the single spectra. Each > averaged replicated time point (spectrum) has a single error > associated with it, the average variance. These are usually different > for different time points, in some cases weakly decreasing > exponentially. So I think we should average the average variances and > have a single measure of spread for all peaks in all spectra. This > estimator of the variance is again bad. Interpolation might be > better, but is still not great. > > Cheers, > > Edward > > > P.S. None of this affects an analysis using peak heights of > non-replicated spectra and the RMSD of the baseplane noise. But if > someone wants to use the peak volume as a measure of peak intensity, > then we have a statistical problem. Until someone finds a reference > or derives the formula for how the RMSD of the base plane noise > relates to volume error, then peak heights will be essential for error > analysis. I'm also willing to be corrected on any of the statistics > above as I'm not an expert in this and may have missed some > fundamental connections between theories. And there may be a less > 'dirty' way of doing the dirty part of the statistics. > > > > On Fri, Oct 17, 2008 at 1:59 AM, Chris MacRaild <[EMAIL PROTECTED]> wrote: > >> On Thu, Oct 16, 2008 at 8:07 PM, Edward d'Auvergne >> <[EMAIL PROTECTED]> wrote: >> >>> On Thu, Oct 16, 2008 at 7:02 AM, Chris MacRaild <[EMAIL PROTECTED]> wrote: >>> >>>> On Thu, Oct 16, 2008 at 3:11 PM, Sébastien Morin >>>> <[EMAIL PROTECTED]> wrote: >>>> >>>>> Hi, >>>>> >>>>> I have a general question about curve fitting within relax. >>>>> >>>>> Let's say I proceed to curve fitting for some relaxation rates >>>>> (exponential decay) and that I have a duplicate delay for error >>>>> estimation. >>>>> >>>>> ======== >>>>> delays >>>>> >>>>> 0.01 >>>>> 0.01 >>>>> 0.02 >>>>> 0.04 >>>>> ... >>>>> ======== >>>>> >>>>> Will the mean value (for delay 0.01) be used for curve fitting and rate >>>>> extraction ? >>>>> Or will both values at delay 0.01 be used during curve fitting, hence >>>>> giving more weight on delay 0.01 ? >>>>> >>>>> In other words, will the fit only use both values at delay 0.01 for >>>>> error estimation or also for rate extraction, giving more weight for >>>>> this duplicate point ? >>>>> >>>>> How is this handled in relax ? >>>>> >>>>> Instinctively, I would guess that the man value must be used for >>>>> fitting, as we don't want the points that are not in duplicate to count >>>>> less in the fitting procedure... Am I right ? >>>>> >>>>> >>>> I would argue not. If we have gone to the trouble of measuring >>>> something twice (or, equivalently, measuring it with greater >>>> precision) then we should weight it more strongly to reflect that. >>>> >>>> So we should include both duplicate points in our fit, or we should >>>> just use the mean value, but weight it to reflect the greater >>>> certainty we have in its value. >>>> >>>> As I type this I realise this is likely the source of the sqrt(2) >>>> factor Tyler and Edward have been debating on a parallel thread - the >>>> uncertainty in height of any one peak is equal to the RMS noise, but >>>> the std error of the mean of duplicates is less by a factor of >>>> sqrt(2). >>>> >>> At the moment, relax simply uses the mean value in the fit. Despite >>> the higher quality of the duplicated data, all points are given the >>> same weight. This is only because of the low data quantity. As for >>> dividing the sd of differences between duplicate spectra by sqrt(2), >>> this is not done in relax anymore. Because some people have collected >>> triplicate spectra, although rare, relax calculates the error from >>> replicated spectra differently. I'm prepared to be told that this >>> technique is incorrect though. The procedure relax uses is to apply >>> the formula: >>> >>> sd^2 = sum({Ii - Iav}^2) / (n - 1), >>> >>> where n is the number of spectra, Ii is the intensity in spectrum i, >>> Iav is the average intensity, sd is the standard deviation, and sd^2 >>> is the variance. This is for a single spin. The sample number is so >>> low that this value is completely meaningless. Therefore the variance >>> is averaged across all spins (well due to a current bug the standard >>> deviation is averaged instead). Then another averaging takes place if >>> not all spectra are duplicated. The variances across all duplicated >>> spectra are averaged to give a single error value for all spins across >>> all spectra (again the sd averaging bug affects this). The reason for >>> using this approach is that you are not limited to duplicate spectra. >>> It also means that the factor of sqrt(2) is not applicable. If only >>> single spectra are collected, then relax's current behaviour of not >>> using sqrt(2) seems reasonable. >>> >>> >> Here is how I understand the sqrt(2) issue: >> >> The sd of duplicate (or triplicate, or quadruplicate, or ... ) peak >> heights is assumed to give a good estimate of the precision with which >> we can measure the height of a single peak. So for peak heights that >> have not been measured in duplicate (ie relaxation times that have not >> been duplicated in our current set of spectra), sd is a good estimate >> of the uncertainty associated with that height. >> >> For peaks we have measured more than once, we can calculate a mean >> peak height. The precision with which we know that mean value is given >> by the std error of the mean ie. sd/sqrt(n) where n is the number of >> times we have measured that specific relaxation time. I think this is >> the origin of the sqrt(2) for duplicate data. >> >> A made up example: >> T I >> 0 1.00 >> 10 0.90 >> 10 0.86 >> 20 0.80 >> 40 0.75 >> 70 0.72 >> 70 0.68 >> 100 0.55 >> 150 0.40 >> 200 0.30 >> >> The std deviation of our duplicates is 0.04 so the uncertainty on each >> value above is 0.04 >> >> BUT, the uncertainty on the mean values for our duplicate time points >> (10 and 70) is 0.04/sqrt(2) = 0.028 >> >> So if we use the mean values as points in our fit, we should use 0.028 >> as the uncertainty on those values (while all other peaks have >> uncertainty 0.04) >> >> Alternatively (and equivalently) we can use the original observations, >> including all duplicate points. In this case, all points have the same >> uncertainty of 0.04, as they are all the result of a single >> measurement. >> >> To do anything else is to underestimate the precision with which we >> have measured our relaxation rates. >> >> Chris >> >> >> >> >> >>> Regards, >>> >>> Edward >>> >>> >>> P.S. The idea for the 1.3 line is to create a new class of user >>> functions, 'spectrum.read_intensities()', 'spectrum.set_rmsd()', >>> 'spectrum.error_analysis()', etc. to make all of this independent of >>> the analysis type. See >>> https://mail.gna.org/public/relax-devel/2008-10/msg00029.html for >>> details. >>> >>> > > _______________________________________________ relax (http://nmr-relax.com) This is the relax-users mailing list relax-users@gna.org To unsubscribe from this list, get a password reminder, or change your subscription options, visit the list information page at https://mail.gna.org/listinfo/relax-users