Re: [task #7882] Implement Monte-Carlo simulation, where errors are generated with width of standard deviation or residuals

Edward d'Auvergne Mon, 19 Jan 2015 05:39:29 -0800

Hi Troels,

Looking at page 28, note that they are using the SSE value of
sum(Y_data - Y_curve)^2 and not the chi2 value of sum((Y_data -
Y_curve)/sigma)^2.  This is a major difference!  The chi-squared value
is the SSE normalised to unit variance.  Any statistics or techniques
relying on the SSE value cannot be used when the chi-squared has been
used - i.e. it cannot be mapped to the relax results.  Also note that
this page/book is talking about non-linear regression
(https://en.wikipedia.org/wiki/Nonlinear_regression) whereas in relax
we use non-linear least squares fitting
(https://en.wikipedia.org/wiki/Non-linear_least_squares).  Although
related these are different fields, so care must be taken to when
using techniques from one in the other.  In most cases that would be
theoretically disallowed.  This statement from the regression
wikipedia article
https://en.wikipedia.org/wiki/Nonlinear_regression#Ordinary_and_weighted_least_squares
is useful:


    "The best-fit curve is often assumed to be that which minimizes
the sum of squared residuals. This is the (ordinary) least squares
(OLS) approach. However, in cases where the dependent variable does
not have constant variance, a sum of weighted squared residuals may be
minimized; see weighted least squares. Each weight should ideally be
equal to the reciprocal of the variance of the observation, but
weights may be recomputed on each iteration, in an iteratively
weighted least squares algorithm."

I.e. the two techniques are assumed to give the same result.  This is
often the case if the errors of all points are the same (though not
always due to the changing in curvature of the space if the errors are
not 1.0), that the noise is Gaussian, and there are no outliers.

Which part of page 28 is disturbing?  We do not make the second
assumption "that the average amount of scatter is the same all the way
along the curve".  The chi-squared value takes care of this - each
point has its own independent error.  The first assumption is also
fine, unless the NMR experiment has failed in some strange way
affecting the spectral baseline or the spectral processing has
introduced baseline artefacts.  As for the weighting, in the
relaxation dispersion analysis, this is not needed.  It is used
sometimes in relax when combining RDC and PCS data in the N-state
model, but that is for when errors are not known and it is used to
make sure the two data types have roughly the same amount of weight
(which a full error analysis would have done).  I'll continue with the
other sections later.

I would also recommend that you have a read of the introduction
chapters 1 and 2 of "Numerical Optimisation" by Jorge Nocedal and
Stephen J. Wright.  This will give a good summary of the least squares
problem in comparison to this GraphPad Prism 4 software book which
covers regression.  This book will be in your maths library, or full
PDFs can be found online.  This book is a brilliant reference for all
optimisation concepts.

Cheers,

Edward

On 19 January 2015 at 10:53, Troels Emtekær Linnet
<tlin...@nmr-relax.com> wrote:
> Hi Edward.
>
> I have used this regression book.
> http://www.graphpad.com/faq/file/Prism4RegressionBook.pdf
>
> I like the language (fun and humoristic), and goes over a great detail.
> For example, there is quite a list of weighting methods.
> I find the comments on page 28 a little disturbing.
> (See also page 86+87)
>
> The Monte-Carlo simulation is described at page 104.
>
> 1) Create an ideal data set.
> -> Check. This is done with relax
> monte_carlo.create_data(method='back_calc')
>
> 2) Add random scatter.
> relax now add random scatter, per individual datapoint. The random scatter
> is drawn from the measured error of the datapoint.
> But the book suggest adding errors described by variance of the residuals.
> This is described at page 33.
> "If you chose to weight the values and minimize the relative distance
> squared (or some other weighting function), goodness-of-fit is quantified
> with the weighted sum- of-squares."
>
> Sy.x = sqrt(SS/dof) = sqrt(chi2 / dof)
> The question is of course, if SS should be sum of squared errors for the
> weighted points, or the non-weighted R2eff points.
>
> Best
> Troels
>
> 2015-01-19 10:31 GMT+01:00 Edward d'Auvergne <edw...@nmr-relax.com>:
>>
>> Hi Troels,
>>
>> Do you have a reference for the technique?  You mentioned a 'fitting
>> guide' in one of your commit messages, but without a reference to it.
>> I would like to study the technique to understand the implementation.
>> Does it have another name?  I would guess so as it breaks the
>> statistical principles that the Monte Carlo simulation technique uses.
>> That is why the monte_carlo.create_data user function says on the
>> first line that the technique is called bootstrapping rather than
>> Monte Carlo simulations if the 'method' argument is changed.   I would
>> also like to check if it is implemented properly.  For example
>> accessing the chi2, converting it to the reduced chi2 and using this
>> as the SSE may not be correct, as the former is normalised by the
>> errors and the later is not.  You may need to recalculate the SSE as
>> there is no way to convert from the full/reduced chi2 value to an SSE
>> value.  I would also like to see if this implementation is a new
>> methodology that sits beside Monte Carlo simulations and Bootstrapping
>> simulations, or if it can be used for both of these.
>>
>> Cheers,
>>
>> Edward
>>
>>
>>
>> On 16 January 2015 at 19:13, Edward d'Auvergne <edw...@nmr-relax.com>
>> wrote:
>> > Hi,
>> >
>> > Sorry, I meant sum_i(residual_i / error_i)/N > 1.0.  As for
>> > calculating the bias value, it looks like Wikipedia has a reasonable
>> > description (https://en.wikipedia.org/wiki/Bias_of_an_estimator).
>> >
>> > Regards,
>> >
>> > Edward
>> >
>> >
>> > On 16 January 2015 at 19:07, Edward d'Auvergne <edw...@nmr-relax.com>
>> > wrote:
>> >> Hi,
>> >>
>> >> If you plot the R2eff errors from the Monte Carlo simulations of that
>> >> model, are they Gaussian?  Well, that's assuming you have full
>> >> dispersion curves.  Theoretically from the white noise in the NMR
>> >> spectrum they should be.  Anyway, even if it not claimed that Monte
>> >> Carlo simulations have failed, and you have small errors from MC
>> >> simulations and large errors from other error analysis technique, and
>> >> then pick the large errors, that implies that the Monte Carlo
>> >> simulations have failed.  As for using residuals, this is a fall-back
>> >> technique when experimental errors have not, or cannot, be measured.
>> >> This uses another convergence assumption - if you have infinite data
>> >> and the model has a bias value of exactly 0.0, then the residuals
>> >> converge to the experimental errors.  For clustering, you might
>> >> violate this bias == 0.0 condition (that is almost guaranteed).  You
>> >> should also plot your residuals.  If the average residual value is not
>> >> 0.0, then you have model bias.  Of if you see a trend in the residual
>> >> plot.
>> >>
>> >> For using the residuals, how do these values compare to the error
>> >> values?  If the residuals are bigger than the experimental error, then
>> >> this also indicates that abs(bias) > 0.0.  A scatter plot of R2eff
>> >> residuals vs. errors might be quite useful.  This should be a linear
>> >> plot with a gradient of 1, anything else indicates bias.  There might
>> >> even be a way of calculating the bias value from the residuals and
>> >> errors, though I've forgotten how this is done.  Anyway, if you have a
>> >> large bias due to the residuals being bigger than the R2eff error,
>> >> using the residuals for error analysis is not correct.  It will
>> >> introduce larger errors, that is guaranteed.  So you will have the
>> >> result that kex has larger errors.  But it is useful to understand the
>> >> theoretical reason why.  A large component of that kex error is the
>> >> modelling bias.  So if sum_i(residual_i / error_i) > 1.0, then you
>> >> likely have under-fitting.  This could be caused by clustering or the
>> >> 2-site model being insufficient.  In any case, using the residuals for
>> >> an error analysis to work around kex errors being too small only
>> >> indicates a problem with the data modelling.
>> >>
>> >> What about testing the covariance matrix technique?  I guess that due
>> >> to small amounts of data that single-item-out cross validation is not
>> >> an option.
>> >>
>> >> Regards,
>> >>
>> >> Edward
>> >>
>> >>
>> >>
>> >> On 16 January 2015 at 18:25, Troels Emtekær Linnet
>> >> <tlin...@nmr-relax.com> wrote:
>> >>> Hi Edward.
>> >>>
>> >>> I do not claim that "Monte Carlo simulations" is not the gold
>> >>> standard.
>> >>>
>> >>> I am merely trying to investigate the method by which one draw the
>> >>> errors.
>> >>>
>> >>> In the current case for dispersion, one trust the R2eff errors to be
>> >>> the
>> >>> distribution.
>> >>> These are individual per spin.
>> >>>
>> >>> Another distribution could be from how well the clustered fit
>> >>> performed.
>> >>> And this is what I am looking into.
>> >>>
>> >>> Best
>> >>> Troels
>> >>>
>> >>> 2015-01-16 18:09 GMT+01:00 Edward d'Auvergne <edw...@nmr-relax.com>:
>> >>>>
>> >>>> Hi,
>> >>>>
>> >>>> Do the R2eff errors look reasonable?  Another issue is in clustered
>> >>>> analysis, certain parameters can be over-constrained by being shared
>> >>>> between multiple data sets.  This is the biased introduced by an
>> >>>> under-fitted problem.  This can artificially decrease the errors.
>> >>>> Anyway, you should plot the Monte Carlo simulations, a bit like I did
>> >>>> in figure 4 of my paper:
>> >>>>
>> >>>>    d'Auvergne, E. J. and Gooley, P. R. (2006). Model-free model
>> >>>> elimination: A new step in the model-free dynamic analysis of NMR
>> >>>> relaxation data. J. Biomol. NMR, 35(2), 117-135.
>> >>>> (http://dx.doi.org/10.1007/s10858-006-9007-z)
>> >>>>
>> >>>> That might indicate if something is wrong - i.e. if optimisation of
>> >>>> certain simulations have failed.  However this problem only causes
>> >>>> errors to be bigger than they should be (unless all simulations have
>> >>>> failed).  I don't know how Monte Carlo simulations could fail
>> >>>> otherwise.  Monte Carlo simulations are the gold standard for error
>> >>>> analysis.  All other error analysis techniques are judged based on
>> >>>> how
>> >>>> close the approach this gold standard.  Saying that the Monte Carlo
>> >>>> simulations technique failed is about equivalent to claiming the
>> >>>> Earth
>> >>>> is flat!  I challenge you to test the statement on a statistics
>> >>>> professor at your Uni ;)  Anyway, if Monte Carlo failed, using
>> >>>> residuals will not save you as the failure point will be present in
>> >>>> both techniques.  What could have failed is the model or the input
>> >>>> data.  Under-fitting due to too much R2eff data variability in the
>> >>>> spins of the cluster would be my guess.  Do you see similarly small
>> >>>> errors in the non-clustered analysis of the same data?
>> >>>>
>> >>>> Regards,
>> >>>>
>> >>>> Edward
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> On 16 January 2015 at 17:48, Troels Emtekær Linnet
>> >>>> <tlin...@nmr-relax.com> wrote:
>> >>>> > Hi Edward.
>> >>>> >
>> >>>> > At the moment, I am fairly confident that I should investigate the
>> >>>> > distribution from which the errors are drawn.
>> >>>> >
>> >>>> > The method in relax draws from a Gauss distribution of the R2eff
>> >>>> > errors,
>> >>>> > but
>> >>>> > I should try to draw errors from the
>> >>>> > overall residual instead.
>> >>>> >
>> >>>> > It is two different methods.
>> >>>> >
>> >>>> > My PI, has earlier has before analysed the data with the
>> >>>> > aforementioned
>> >>>> > method, and got errors in the hundreds.
>> >>>> > Errors are 5-10% of the fitted global parameters.
>> >>>> >
>> >>>> > Having 0.5-1 percent error is way to small, and I see this for 4 of
>> >>>> > my
>> >>>> > datasets.
>> >>>> >
>> >>>> > So, something is fishy.
>> >>>> >
>> >>>> > Best
>> >>>> > Troels
>> >>>> >
>> >>>> > 2015-01-16 17:30 GMT+01:00 Edward d'Auvergne
>> >>>> > <edw...@nmr-relax.com>:
>> >>>> >>
>> >>>> >> Hi Troels,
>> >>>> >>
>> >>>> >> You should be very careful with your interpretation here.  The
>> >>>> >> curvature of the chi-squared space does not correlate with the
>> >>>> >> parameter errors!  Well, it most cases it doesn't.  You will see
>> >>>> >> this
>> >>>> >> if you map the space for different Monte Carlo simulations.  Some
>> >>>> >> extreme edge cases might help in understanding the problem.  Lets
>> >>>> >> say
>> >>>> >> you have a kex value of 100 with a real error of 1000.  In this
>> >>>> >> case,
>> >>>> >> you could still have a small, perfectly quadratic minimum.  But
>> >>>> >> this
>> >>>> >> minimum will jump all over the place with the simulations.
>> >>>> >> Another
>> >>>> >> extreme example might be kex of 100 with a real error of
>> >>>> >> 0.00000001.
>> >>>> >> In this case, the chi-squared space could look similar to the
>> >>>> >> screenshot you attached to the task ( http://gna.org/task/?7882).
>> >>>> >> However Monte Carlo simulations may hardly perturb the chi-squared
>> >>>> >> space.  I have observed scenarios similar to these hypothetical
>> >>>> >> cases
>> >>>> >> with the Lipari and Szabo model-free protein dynamics analysis.
>> >>>> >>
>> >>>> >> There is one case where the chi-squared space and error space
>> >>>> >> match,
>> >>>> >> and that is at the limit of the minimum when the chi-squared space
>> >>>> >> becomes quadratic.  This happens when you zoom right into the
>> >>>> >> minimum.
>> >>>> >> The correlation matrix approach makes this assumption.  Monte
>> >>>> >> Carlo
>> >>>> >> simulations do not.  In fact, Monte Carlo simulations are the gold
>> >>>> >> standard.  There is no technique which is better than Monte Carlo
>> >>>> >> simulations, if you use enough simulations.  You can only match it
>> >>>> >> by
>> >>>> >> deriving exact symbolic error equations.
>> >>>> >>
>> >>>> >> Therefore you really should investigate how your optimisation
>> >>>> >> space is
>> >>>> >> perturbed by Monte Carlo simulations to understand the correlation
>> >>>> >> -
>> >>>> >> or non-correlation - of the chi-squared curvature and the
>> >>>> >> parameter
>> >>>> >> errors.  Try mapping the minimum for the simulations and see if
>> >>>> >> the
>> >>>> >> distribution of minima matches the chi-squared curvature
>> >>>> >> (http://gna.org/task/download.php?file_id=23527).
>> >>>> >>
>> >>>> >> Regards,
>> >>>> >>
>> >>>> >> Edward
>> >>>> >>
>> >>>> >>
>> >>>> >> On 16 January 2015 at 17:14, Troels E. Linnet
>> >>>> >> <no-reply.invalid-addr...@gna.org> wrote:
>> >>>> >> > URL:
>> >>>> >> >   <http://gna.org/task/?7882>
>> >>>> >> >
>> >>>> >> >                  Summary: Implement Monte-Carlo simulation,
>> >>>> >> > where
>> >>>> >> > errors
>> >>>> >> > are
>> >>>> >> > generated with width of standard deviation or residuals
>> >>>> >> >                  Project: relax
>> >>>> >> >             Submitted by: tlinnet
>> >>>> >> >             Submitted on: Fri 16 Jan 2015 04:14:30 PM UTC
>> >>>> >> >          Should Start On: Fri 16 Jan 2015 12:00:00 AM UTC
>> >>>> >> >    Should be Finished on: Fri 16 Jan 2015 12:00:00 AM UTC
>> >>>> >> >                 Category: relax's source code
>> >>>> >> >                 Priority: 5 - Normal
>> >>>> >> >                   Status: In Progress
>> >>>> >> >         Percent Complete: 0%
>> >>>> >> >              Assigned to: tlinnet
>> >>>> >> >              Open/Closed: Open
>> >>>> >> >          Discussion Lock: Any
>> >>>> >> >                   Effort: 0.00
>> >>>> >> >
>> >>>> >> >     _______________________________________________________
>> >>>> >> >
>> >>>> >> > Details:
>> >>>> >> >
>> >>>> >> > This is implemented due to strange results.
>> >>>> >> >
>> >>>> >> > A relaxation dispersion on data with 61 spins, and a monte carlo
>> >>>> >> > simulation
>> >>>> >> > with 500 steps, showed un-expected low errors.
>> >>>> >> >
>> >>>> >> > -------
>> >>>> >> > results.read(file=fname_results, dir=dir_results)
>> >>>> >> >
>> >>>> >> > # Number of MC
>> >>>> >> > mc_nr = 500
>> >>>> >> >
>> >>>> >> > monte_carlo.setup(number=mc_nr)
>> >>>> >> > monte_carlo.create_data()
>> >>>> >> > monte_carlo.initial_values()
>> >>>> >> > minimise.execute(min_algor='simplex', func_tol=1e-25,
>> >>>> >> > max_iter=int(1e7),
>> >>>> >> > constraints=True)
>> >>>> >> > monte_carlo.error_analysis()
>> >>>> >> > --------
>> >>>> >> >
>> >>>> >> > The kex was 2111 and with error 16.6.
>> >>>> >> >
>> >>>> >> > When performing a dx.map, some weird results was found:
>> >>>> >> >
>> >>>> >> > i_sort    dw_sort    pA_sort    kex_sort      chi2_sort
>> >>>> >> > 471       4.50000    0.99375    2125.00000    4664.31083
>> >>>> >> > 470       4.50000    0.99375    1750.00000    4665.23872
>> >>>> >> >
>> >>>> >> > So, even a small change with chi2, should reflect a larger
>> >>>> >> > deviation with kex.
>> >>>> >> >
>> >>>> >> > It seems, that change of R2eff values according to their errors,
>> >>>> >> > is
>> >>>> >> > not
>> >>>> >> > "enough".
>> >>>> >> >
>> >>>> >> > According to the regression book of Graphpad
>> >>>> >> > http://www.graphpad.com/faq/file/Prism4RegressionBook.pdf
>> >>>> >> >
>> >>>> >> > Page 33, and 104.
>> >>>> >> > Standard deviation of residuals is:
>> >>>> >> >
>> >>>> >> > Sxy = sqrt(SS/(N-p))
>> >>>> >> >
>> >>>> >> > where SS is sum of squares. N - p, is the number of degrees of
>> >>>> >> > freedom.
>> >>>> >> > In relax, SS is spin.chi2, and is weighted.
>> >>>> >> >
>> >>>> >> > The random scatter to each R2eff point should be drawn from a
>> >>>> >> > gaussian
>> >>>> >> > distribution with a mean of Zero and SD equal to Sxy.
>> >>>> >> >
>> >>>> >> > Additional, find the 2.5 and 97.5 percentile for each parameter.
>> >>>> >> > The range between these values is the confidence interval.
>> >>>> >> >
>> >>>> >> >
>> >>>> >> >
>> >>>> >> >
>> >>>> >> >     _______________________________________________________
>> >>>> >> >
>> >>>> >> > File Attachments:
>> >>>> >> >
>> >>>> >> >
>> >>>> >> > -------------------------------------------------------
>> >>>> >> > Date: Fri 16 Jan 2015 04:14:30 PM UTC  Name: Screenshot-1.png
>> >>>> >> > Size:
>> >>>> >> > 161kB
>> >>>> >> > By: tlinnet
>> >>>> >> >
>> >>>> >> > <http://gna.org/task/download.php?file_id=23527>
>> >>>> >> >
>> >>>> >> >     _______________________________________________________
>> >>>> >> >
>> >>>> >> > Reply to this item at:
>> >>>> >> >
>> >>>> >> >   <http://gna.org/task/?7882>
>> >>>> >> >
>> >>>> >> > _______________________________________________
>> >>>> >> >   Message sent via/by Gna!
>> >>>> >> >   http://gna.org/
>> >>>> >> >
>> >>>> >> >
>> >>>> >> > _______________________________________________
>> >>>> >> > relax (http://www.nmr-relax.com)
>> >>>> >> >
>> >>>> >> > This is the relax-devel mailing list
>> >>>> >> > relax-devel@gna.org
>> >>>> >> >
>> >>>> >> > To unsubscribe from this list, get a password
>> >>>> >> > reminder, or change your subscription options,
>> >>>> >> > visit the list information page at
>> >>>> >> > https://mail.gna.org/listinfo/relax-devel
>> >>>> >
>> >>>> >
>> >>>
>> >>>
>
>

_______________________________________________
relax (http://www.nmr-relax.com)

This is the relax-devel mailing list
relax-devel@gna.org

To unsubscribe from this list, get a password
reminder, or change your subscription options,
visit the list information page at
https://mail.gna.org/listinfo/relax-devel

Re: [task #7882] Implement Monte-Carlo simulation, where errors are generated with width of standard deviation or residuals

Reply via email to