Re: Curve fitting

Edward d'Auvergne Fri, 17 Oct 2008 14:20:59 -0700

Hi,

Before you sent this message, I was talking to Ben Frank (a PhD
student in Griesinger's lab) about this exact problem - baseplane RMSD
noise to volume error.  The formula of Nicholson et al., 1992 you
mentioned makes perfect sense as that's what we came up with too.
Volume integration over a given area is the sum of the heights of all
the discrete points in the frequency domain spectrum within that box.
So the error of a single point is the same as that of the peak height.
 We just have n*m points within this box.  And as variances add, not
standard deviations, then the variance (sigma^2) of the volume is:


sigma_vol^2 = sigma_i^2 * n * m,

where sigma_vol is the standard deviation of the volume, sigma_i is
the standard deviation of a single point assumed to be equal to the
RMSD of the baseplane noise, and n and m are the dimensions of the
box.  Taking the square root of this gives the Nicholson et al.
formula:

sigma_vol = sigma_i * sqrt(n*m).

However I doubt if many people in the field use this volume
integration method.  I know that this is only one method of 3 in
Sparky (http://www.cgl.ucsf.edu/home/sparky/manual/peaks.html#Integration).
 The program peakint
(http://hugin.ethz.ch/wuthrich/software/xeasy/xeasy_m15.html) doesn't
use all points in the box.  And Cara does things differently again
(http://www.cara.ethz.ch/Wiki/Integration).  So even if the user has
the baseplane RMSD measured, I have no idea if they can work out how
many points in the spectra were use in the integration.  And if a
Gaussian or Lorentzian fit is used, then this formula definitely
cannot be used.  Maybe support for the different methods of the
different programs can be added one by one to relax, or maybe ask the
user which type of fit was used and how many points were integrated,
but this is going to be difficult to do.

Regards,

Edward


On Fri, Oct 17, 2008 at 6:15 PM, Sébastien Morin
<[EMAIL PROTECTED]> wrote:
> Hi all,
>
> Here is a reference some people in our lab have been using to extract
> errors on NOE calculated from volumes.
>
> Nicholson, Kay, Baldisseri, Arango, Young, Bax, and Torchia (1992)
> Biochemistry, 31: 5253-5263.
>
> And here is the interesting part...
>
> ======================
> To estimate the error in NOE values, the standard deviation in baseline
> noise (a) was determined for each spectrum and expanded into a standard
> deviation in volume given by
>
>                        Delta V = sigma sqrt(nk)
>
> where n is the number of points in the f1 dimension and k is the number
> of points in the f2 dimension of a typical peak at the baseline. The
> error associated with the NOE value for peak j is then given by
>
> error_j = (V_A_j / V_B_j) * sqrt[(Delta_V_A / V_A_j)^2 + (Delta_V_B /
> V_B_j)^2]
>
> where V_A and V_B denote the volume of peak j in the presence and
> absense of NOE enhancement, respectively, and Delta_V_A and Delta_V_B
> denote the standard deviations in volume for the spectra recorded in the
> presence and absense of NOE enhancement, respectively.
> ======================
>
> I'm not an expert in statistics, so won't judge the value of this
> approach. However, this is an approach some people have been using...
>
> Ciao !
>
>
> Séb
>
>
>
>
> Edward d'Auvergne wrote:
>> Oh, I forgot about the std error formula.  Is where the sqrt(2) comes
>> from?  Doh, that would be retarded.  Then I know someone who would
>> require sqrt(3) for the NOE spectra!  Is that really what Palmer
>> meant, that std error is the same as "the standard deviation of the
>> differences between the heights of corresponding peaks in the paired
>> spectra" which "is equal to sqrt(2)*sigma" (Palmer et al., 1991)?
>>
>> I'm pretty sure though that the standard error is not the measure we
>> want for the confidence interval of the peak intensity.  The reason is
>> because I think that the std error is a measure of how far the sample
>> mean is from the true mean (ignore this, this is a quick reference for
>> myself: http://en.wikipedia.org/wiki/Standard_error_(statistics) ).
>> (Warning, from here to the end of the paragraph is a rant!) This is
>> similar in concept to AIC model selection (see
>> http://en.wikipedia.org/wiki/Akaike_information_criterion and
>> http://en.wikipedia.org/wiki/Model_selection if you haven't heard
>> about the advanced statistical field of model selection before).  AIC
>> is a little more advanced though as it estimates the Kullback-Leibler
>> discrepancy (http://en.wikipedia.org/wiki/Kullback–Leibler_divergence)
>> which is a measure of distance between the true distribution and the
>> back-calculated distribution using all information about the
>> distribution.  Ok, that wasn't too relevant.  Anyway, the std error as
>> a measure of the differences in means of 2 different distributions is
>> not a measure of the spread of either the true, measured, or
>> back-calculated distributions (or the 4th distribution, the
>> back-calculated from the fit to the true model).  The std error is not
>> the confidence intervals of any of these 4 distributions, just the
>> difference between 2 of them using only a small part of the
>> information of those distributions.  It's the statistical measure of
>> the difference in means of the true and measured distributions.  As an
>> aside, for those completely lost now a clearer explanation of these 4
>> distributions fundamental to data analysis, likelihood, discrepancies,
>> etc. can be read in section 2.2 of my PhD thesis at
>> http://dtl.unimelb.edu.au:80/R/-?func=dbin-jump-full&amp;object_id=67077&amp;current_base=GEN01
>> or 
>> http://www.amazon.com/Protein-Dynamics-Model-free-Analysis-Relaxation/dp/3639057627/ref=sr_1_6?ie=UTF8&s=books&qid=1219247007&sr=8-6
>> (sorry for the blatant plug ;).  Oh, the best way to picture all of
>> these concepts and the links between them is to draw and label 4
>> distributions on a piece of paper on the same x and y-axes with not
>> too much overlap between them and connect them with arrows labelled
>> with all the weird terminology.
>>
>> Sorry, that was just a long way of saying that the std error is the
>> quality of how the sample mean matches the real mean and how the
>> standard deviation is the spread of the distribution.  That being so,
>> I would avoid setting the standard error as the peak height
>> uncertainty.  Maybe it would be best to do as you say Chris, and also
>> avoid the averaging of the replicated intensities.
>>
>> So, now to the dirty statistics required for the sparse sampling
>> caused by lack of NMR time.  We can use the standard deviation formula
>> - the measure of the spread of the measured distribution which is the
>> estimator of the spread of the true distribution that optimisation
>> (which is chi-squared fitting and hence finding the maximal likelihood
>> (http://en.wikipedia.org/wiki/Maximum_likelihood)), Monte Carlo
>> simulations, and the frequentist model selection techniques all
>> utilise as the foundation for their fundamental derivations.  We would
>> then square this to get the variance:
>>
>> sigma^2 = sum({Ii - Iav}^2) / (n - 1).
>>
>> In my previous posts I was saying sigma was the variance, but just
>> ignore that.  Next we need to average the variance across all spins,
>> simply because the sample size is so low for each peak and hence the
>> error estimate is horrible.  Whether this estimator of the true
>> variance is good or bad is debatable (well, actually, it's bad), but
>> it is unavoidable.  It also has the obvious disadvantage in that the
>> peak height error is, in reality, different for each peak.
>>
>> Now, if not all spectra are replicated, then the approach needs to be
>> modified to give us errors for the peaks of the single spectra.  Each
>> averaged replicated time point (spectrum) has a single error
>> associated with it, the average variance.  These are usually different
>> for different time points, in some cases weakly decreasing
>> exponentially.  So I think we should average the average variances and
>> have a single measure of spread for all peaks in all spectra.  This
>> estimator of the variance is again bad.  Interpolation might be
>> better, but is still not great.
>>
>> Cheers,
>>
>> Edward
>>
>>
>> P.S.  None of this affects an analysis using peak heights of
>> non-replicated spectra and the RMSD of the baseplane noise.  But if
>> someone wants to use the peak volume as a measure of peak intensity,
>> then we have a statistical problem.  Until someone finds a reference
>> or derives the formula for how the RMSD of the base plane noise
>> relates to volume error, then peak heights will be essential for error
>> analysis.  I'm also willing to be corrected on any of the statistics
>> above as I'm not an expert in this and may have missed some
>> fundamental connections between theories.  And there may be a less
>> 'dirty' way of doing the dirty part of the statistics.
>>
>>
>>
>> On Fri, Oct 17, 2008 at 1:59 AM, Chris MacRaild <[EMAIL PROTECTED]> wrote:
>>
>>> On Thu, Oct 16, 2008 at 8:07 PM, Edward d'Auvergne
>>> <[EMAIL PROTECTED]> wrote:
>>>
>>>> On Thu, Oct 16, 2008 at 7:02 AM, Chris MacRaild <[EMAIL PROTECTED]> wrote:
>>>>
>>>>> On Thu, Oct 16, 2008 at 3:11 PM, Sébastien Morin
>>>>> <[EMAIL PROTECTED]> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I have a general question about curve fitting within relax.
>>>>>>
>>>>>> Let's say I proceed to curve fitting for some relaxation rates
>>>>>> (exponential decay) and that I have a duplicate delay for error 
>>>>>> estimation.
>>>>>>
>>>>>> ========
>>>>>> delays
>>>>>>
>>>>>> 0.01
>>>>>> 0.01
>>>>>> 0.02
>>>>>> 0.04
>>>>>> ...
>>>>>> ========
>>>>>>
>>>>>> Will the mean value (for delay 0.01) be used for curve fitting and rate
>>>>>> extraction ?
>>>>>> Or will both values at delay 0.01 be used during curve fitting, hence
>>>>>> giving more weight on delay 0.01 ?
>>>>>>
>>>>>> In other words, will the fit only use both values at delay 0.01 for
>>>>>> error estimation or also for rate extraction, giving more weight for
>>>>>> this duplicate point ?
>>>>>>
>>>>>> How is this handled in relax ?
>>>>>>
>>>>>> Instinctively, I would guess that the man value must be used for
>>>>>> fitting, as we don't want the points that are not in duplicate to count
>>>>>> less in the fitting procedure... Am I right ?
>>>>>>
>>>>>>
>>>>> I would argue not. If we have gone to the trouble of measuring
>>>>> something twice (or, equivalently, measuring it with greater
>>>>> precision) then we should weight it more strongly to reflect that.
>>>>>
>>>>> So we should include both duplicate points in our fit, or we should
>>>>> just use the mean value, but weight it to reflect the greater
>>>>> certainty we have in its value.
>>>>>
>>>>> As I type this I realise this is likely the source of the sqrt(2)
>>>>> factor Tyler and Edward have been debating on a parallel thread - the
>>>>> uncertainty in height of any one peak is equal to the RMS noise, but
>>>>> the std error of the mean of duplicates is less by a factor of
>>>>> sqrt(2).
>>>>>
>>>> At the moment, relax simply uses the mean value in the fit.  Despite
>>>> the higher quality of the duplicated data, all points are given the
>>>> same weight.  This is only because of the low data quantity.  As for
>>>> dividing the sd of differences between duplicate spectra by sqrt(2),
>>>> this is not done in relax anymore.  Because some people have collected
>>>> triplicate spectra, although rare, relax calculates the error from
>>>> replicated spectra differently.  I'm prepared to be told that this
>>>> technique is incorrect though.  The procedure relax uses is to apply
>>>> the formula:
>>>>
>>>> sd^2 = sum({Ii - Iav}^2) / (n - 1),
>>>>
>>>> where n is the number of spectra, Ii is the intensity in spectrum i,
>>>> Iav is the average intensity, sd is the standard deviation, and sd^2
>>>> is the variance.  This is for a single spin.  The sample number is so
>>>> low that this value is completely meaningless.  Therefore the variance
>>>> is averaged across all spins (well due to a current bug the standard
>>>> deviation is averaged instead).  Then another averaging takes place if
>>>> not all spectra are duplicated.  The variances across all duplicated
>>>> spectra are averaged to give a single error value for all spins across
>>>> all spectra (again the sd averaging bug affects this).  The reason for
>>>> using this approach is that you are not limited to duplicate spectra.
>>>> It also means that the factor of sqrt(2) is not applicable.  If only
>>>> single spectra are collected, then relax's current behaviour of not
>>>> using sqrt(2) seems reasonable.
>>>>
>>>>
>>> Here is how I understand the sqrt(2) issue:
>>>
>>> The sd of duplicate (or triplicate, or quadruplicate, or ... ) peak
>>> heights is assumed to give a good estimate of the precision with which
>>> we can measure the height of a single peak. So for peak heights that
>>> have not been measured in duplicate (ie relaxation times that have not
>>> been duplicated in our current set of spectra), sd is a good estimate
>>> of the uncertainty associated with that height.
>>>
>>> For peaks we have measured more than once, we can calculate a mean
>>> peak height. The precision with which we know that mean value is given
>>> by the std error of the mean ie. sd/sqrt(n) where n is the number of
>>> times we have measured that specific relaxation time. I think this is
>>> the origin of the sqrt(2) for duplicate data.
>>>
>>> A made up example:
>>> T           I
>>> 0          1.00
>>> 10        0.90
>>> 10        0.86
>>> 20        0.80
>>> 40        0.75
>>> 70        0.72
>>> 70        0.68
>>> 100      0.55
>>> 150      0.40
>>> 200      0.30
>>>
>>> The std deviation of our duplicates is 0.04 so the uncertainty on each
>>> value above is 0.04
>>>
>>> BUT, the uncertainty on the mean values for our duplicate time points
>>> (10 and 70) is 0.04/sqrt(2) = 0.028
>>>
>>> So if we use the mean values as points in our fit, we should use 0.028
>>> as the uncertainty on those values (while all other peaks have
>>> uncertainty 0.04)
>>>
>>> Alternatively (and equivalently) we can use the original observations,
>>> including all duplicate points. In this case, all points have the same
>>> uncertainty of 0.04, as they are all the result of a single
>>> measurement.
>>>
>>> To do anything else is to underestimate the precision with which we
>>> have measured our relaxation rates.
>>>
>>> Chris
>>>
>>>
>>>
>>>
>>>
>>>> Regards,
>>>>
>>>> Edward
>>>>
>>>>
>>>> P.S.  The idea for the 1.3 line is to create a new class of user
>>>> functions, 'spectrum.read_intensities()', 'spectrum.set_rmsd()',
>>>> 'spectrum.error_analysis()', etc. to make all of this independent of
>>>> the analysis type.  See
>>>> https://mail.gna.org/public/relax-devel/2008-10/msg00029.html for
>>>> details.
>>>>
>>>>
>>
>>
>
>

_______________________________________________
relax (http://nmr-relax.com)

This is the relax-users mailing list
relax-users@gna.org

To unsubscribe from this list, get a password
reminder, or change your subscription options,
visit the list information page at
https://mail.gna.org/listinfo/relax-users

Re: Curve fitting

Reply via email to