Re: Curve fitting

Sébastien Morin Fri, 17 Oct 2008 09:16:30 -0700

Hi all,

Here is a reference some people in our lab have been using to extract
errors on NOE calculated from volumes.


Nicholson, Kay, Baldisseri, Arango, Young, Bax, and Torchia (1992)
Biochemistry, 31: 5253-5263.

And here is the interesting part...

======================
To estimate the error in NOE values, the standard deviation in baseline
noise (a) was determined for each spectrum and expanded into a standard
deviation in volume given by

                        Delta V = sigma sqrt(nk)

where n is the number of points in the f1 dimension and k is the number
of points in the f2 dimension of a typical peak at the baseline. The
error associated with the NOE value for peak j is then given by

error_j = (V_A_j / V_B_j) * sqrt[(Delta_V_A / V_A_j)^2 + (Delta_V_B /
V_B_j)^2]

where V_A and V_B denote the volume of peak j in the presence and
absense of NOE enhancement, respectively, and Delta_V_A and Delta_V_B
denote the standard deviations in volume for the spectra recorded in the
presence and absense of NOE enhancement, respectively.
======================

I'm not an expert in statistics, so won't judge the value of this
approach. However, this is an approach some people have been using...

Ciao !


Séb




Edward d'Auvergne wrote:
> Oh, I forgot about the std error formula.  Is where the sqrt(2) comes
> from?  Doh, that would be retarded.  Then I know someone who would
> require sqrt(3) for the NOE spectra!  Is that really what Palmer
> meant, that std error is the same as "the standard deviation of the
> differences between the heights of corresponding peaks in the paired
> spectra" which "is equal to sqrt(2)*sigma" (Palmer et al., 1991)?
>
> I'm pretty sure though that the standard error is not the measure we
> want for the confidence interval of the peak intensity.  The reason is
> because I think that the std error is a measure of how far the sample
> mean is from the true mean (ignore this, this is a quick reference for
> myself: http://en.wikipedia.org/wiki/Standard_error_(statistics) ).
> (Warning, from here to the end of the paragraph is a rant!) This is
> similar in concept to AIC model selection (see
> http://en.wikipedia.org/wiki/Akaike_information_criterion and
> http://en.wikipedia.org/wiki/Model_selection if you haven't heard
> about the advanced statistical field of model selection before).  AIC
> is a little more advanced though as it estimates the Kullback-Leibler
> discrepancy (http://en.wikipedia.org/wiki/Kullback–Leibler_divergence)
> which is a measure of distance between the true distribution and the
> back-calculated distribution using all information about the
> distribution.  Ok, that wasn't too relevant.  Anyway, the std error as
> a measure of the differences in means of 2 different distributions is
> not a measure of the spread of either the true, measured, or
> back-calculated distributions (or the 4th distribution, the
> back-calculated from the fit to the true model).  The std error is not
> the confidence intervals of any of these 4 distributions, just the
> difference between 2 of them using only a small part of the
> information of those distributions.  It's the statistical measure of
> the difference in means of the true and measured distributions.  As an
> aside, for those completely lost now a clearer explanation of these 4
> distributions fundamental to data analysis, likelihood, discrepancies,
> etc. can be read in section 2.2 of my PhD thesis at
> http://dtl.unimelb.edu.au:80/R/-?func=dbin-jump-full&amp;object_id=67077&amp;current_base=GEN01
> or 
> http://www.amazon.com/Protein-Dynamics-Model-free-Analysis-Relaxation/dp/3639057627/ref=sr_1_6?ie=UTF8&s=books&qid=1219247007&sr=8-6
> (sorry for the blatant plug ;).  Oh, the best way to picture all of
> these concepts and the links between them is to draw and label 4
> distributions on a piece of paper on the same x and y-axes with not
> too much overlap between them and connect them with arrows labelled
> with all the weird terminology.
>
> Sorry, that was just a long way of saying that the std error is the
> quality of how the sample mean matches the real mean and how the
> standard deviation is the spread of the distribution.  That being so,
> I would avoid setting the standard error as the peak height
> uncertainty.  Maybe it would be best to do as you say Chris, and also
> avoid the averaging of the replicated intensities.
>
> So, now to the dirty statistics required for the sparse sampling
> caused by lack of NMR time.  We can use the standard deviation formula
> - the measure of the spread of the measured distribution which is the
> estimator of the spread of the true distribution that optimisation
> (which is chi-squared fitting and hence finding the maximal likelihood
> (http://en.wikipedia.org/wiki/Maximum_likelihood)), Monte Carlo
> simulations, and the frequentist model selection techniques all
> utilise as the foundation for their fundamental derivations.  We would
> then square this to get the variance:
>
> sigma^2 = sum({Ii - Iav}^2) / (n - 1).
>
> In my previous posts I was saying sigma was the variance, but just
> ignore that.  Next we need to average the variance across all spins,
> simply because the sample size is so low for each peak and hence the
> error estimate is horrible.  Whether this estimator of the true
> variance is good or bad is debatable (well, actually, it's bad), but
> it is unavoidable.  It also has the obvious disadvantage in that the
> peak height error is, in reality, different for each peak.
>
> Now, if not all spectra are replicated, then the approach needs to be
> modified to give us errors for the peaks of the single spectra.  Each
> averaged replicated time point (spectrum) has a single error
> associated with it, the average variance.  These are usually different
> for different time points, in some cases weakly decreasing
> exponentially.  So I think we should average the average variances and
> have a single measure of spread for all peaks in all spectra.  This
> estimator of the variance is again bad.  Interpolation might be
> better, but is still not great.
>
> Cheers,
>
> Edward
>
>
> P.S.  None of this affects an analysis using peak heights of
> non-replicated spectra and the RMSD of the baseplane noise.  But if
> someone wants to use the peak volume as a measure of peak intensity,
> then we have a statistical problem.  Until someone finds a reference
> or derives the formula for how the RMSD of the base plane noise
> relates to volume error, then peak heights will be essential for error
> analysis.  I'm also willing to be corrected on any of the statistics
> above as I'm not an expert in this and may have missed some
> fundamental connections between theories.  And there may be a less
> 'dirty' way of doing the dirty part of the statistics.
>
>
>
> On Fri, Oct 17, 2008 at 1:59 AM, Chris MacRaild <[EMAIL PROTECTED]> wrote:
>   
>> On Thu, Oct 16, 2008 at 8:07 PM, Edward d'Auvergne
>> <[EMAIL PROTECTED]> wrote:
>>     
>>> On Thu, Oct 16, 2008 at 7:02 AM, Chris MacRaild <[EMAIL PROTECTED]> wrote:
>>>       
>>>> On Thu, Oct 16, 2008 at 3:11 PM, Sébastien Morin
>>>> <[EMAIL PROTECTED]> wrote:
>>>>         
>>>>> Hi,
>>>>>
>>>>> I have a general question about curve fitting within relax.
>>>>>
>>>>> Let's say I proceed to curve fitting for some relaxation rates
>>>>> (exponential decay) and that I have a duplicate delay for error 
>>>>> estimation.
>>>>>
>>>>> ========
>>>>> delays
>>>>>
>>>>> 0.01
>>>>> 0.01
>>>>> 0.02
>>>>> 0.04
>>>>> ...
>>>>> ========
>>>>>
>>>>> Will the mean value (for delay 0.01) be used for curve fitting and rate
>>>>> extraction ?
>>>>> Or will both values at delay 0.01 be used during curve fitting, hence
>>>>> giving more weight on delay 0.01 ?
>>>>>
>>>>> In other words, will the fit only use both values at delay 0.01 for
>>>>> error estimation or also for rate extraction, giving more weight for
>>>>> this duplicate point ?
>>>>>
>>>>> How is this handled in relax ?
>>>>>
>>>>> Instinctively, I would guess that the man value must be used for
>>>>> fitting, as we don't want the points that are not in duplicate to count
>>>>> less in the fitting procedure... Am I right ?
>>>>>
>>>>>           
>>>> I would argue not. If we have gone to the trouble of measuring
>>>> something twice (or, equivalently, measuring it with greater
>>>> precision) then we should weight it more strongly to reflect that.
>>>>
>>>> So we should include both duplicate points in our fit, or we should
>>>> just use the mean value, but weight it to reflect the greater
>>>> certainty we have in its value.
>>>>
>>>> As I type this I realise this is likely the source of the sqrt(2)
>>>> factor Tyler and Edward have been debating on a parallel thread - the
>>>> uncertainty in height of any one peak is equal to the RMS noise, but
>>>> the std error of the mean of duplicates is less by a factor of
>>>> sqrt(2).
>>>>         
>>> At the moment, relax simply uses the mean value in the fit.  Despite
>>> the higher quality of the duplicated data, all points are given the
>>> same weight.  This is only because of the low data quantity.  As for
>>> dividing the sd of differences between duplicate spectra by sqrt(2),
>>> this is not done in relax anymore.  Because some people have collected
>>> triplicate spectra, although rare, relax calculates the error from
>>> replicated spectra differently.  I'm prepared to be told that this
>>> technique is incorrect though.  The procedure relax uses is to apply
>>> the formula:
>>>
>>> sd^2 = sum({Ii - Iav}^2) / (n - 1),
>>>
>>> where n is the number of spectra, Ii is the intensity in spectrum i,
>>> Iav is the average intensity, sd is the standard deviation, and sd^2
>>> is the variance.  This is for a single spin.  The sample number is so
>>> low that this value is completely meaningless.  Therefore the variance
>>> is averaged across all spins (well due to a current bug the standard
>>> deviation is averaged instead).  Then another averaging takes place if
>>> not all spectra are duplicated.  The variances across all duplicated
>>> spectra are averaged to give a single error value for all spins across
>>> all spectra (again the sd averaging bug affects this).  The reason for
>>> using this approach is that you are not limited to duplicate spectra.
>>> It also means that the factor of sqrt(2) is not applicable.  If only
>>> single spectra are collected, then relax's current behaviour of not
>>> using sqrt(2) seems reasonable.
>>>
>>>       
>> Here is how I understand the sqrt(2) issue:
>>
>> The sd of duplicate (or triplicate, or quadruplicate, or ... ) peak
>> heights is assumed to give a good estimate of the precision with which
>> we can measure the height of a single peak. So for peak heights that
>> have not been measured in duplicate (ie relaxation times that have not
>> been duplicated in our current set of spectra), sd is a good estimate
>> of the uncertainty associated with that height.
>>
>> For peaks we have measured more than once, we can calculate a mean
>> peak height. The precision with which we know that mean value is given
>> by the std error of the mean ie. sd/sqrt(n) where n is the number of
>> times we have measured that specific relaxation time. I think this is
>> the origin of the sqrt(2) for duplicate data.
>>
>> A made up example:
>> T           I
>> 0          1.00
>> 10        0.90
>> 10        0.86
>> 20        0.80
>> 40        0.75
>> 70        0.72
>> 70        0.68
>> 100      0.55
>> 150      0.40
>> 200      0.30
>>
>> The std deviation of our duplicates is 0.04 so the uncertainty on each
>> value above is 0.04
>>
>> BUT, the uncertainty on the mean values for our duplicate time points
>> (10 and 70) is 0.04/sqrt(2) = 0.028
>>
>> So if we use the mean values as points in our fit, we should use 0.028
>> as the uncertainty on those values (while all other peaks have
>> uncertainty 0.04)
>>
>> Alternatively (and equivalently) we can use the original observations,
>> including all duplicate points. In this case, all points have the same
>> uncertainty of 0.04, as they are all the result of a single
>> measurement.
>>
>> To do anything else is to underestimate the precision with which we
>> have measured our relaxation rates.
>>
>> Chris
>>
>>
>>
>>
>>     
>>> Regards,
>>>
>>> Edward
>>>
>>>
>>> P.S.  The idea for the 1.3 line is to create a new class of user
>>> functions, 'spectrum.read_intensities()', 'spectrum.set_rmsd()',
>>> 'spectrum.error_analysis()', etc. to make all of this independent of
>>> the analysis type.  See
>>> https://mail.gna.org/public/relax-devel/2008-10/msg00029.html for
>>> details.
>>>
>>>       
>
>   



_______________________________________________
relax (http://nmr-relax.com)

This is the relax-users mailing list
relax-users@gna.org

To unsubscribe from this list, get a password
reminder, or change your subscription options,
visit the list information page at
https://mail.gna.org/listinfo/relax-users

Re: Curve fitting

Reply via email to