Re: Speeding up the relaxation dispersion analysis - optimisation of the missing data handling.

Edward d'Auvergne Mon, 19 May 2014 02:29:18 -0700

Here is a hint for the ultimate speed up of the relaxation dispersion
analysis.  This will not work for all dispersion models though,
especially the numeric models.  The idea is to eliminate all looping
in the target functions and lib.dispersion modules, specifically:


# Loop over the experiment types.
for ei in range(self.num_exp):
    # Loop over the spins.
    for si in range(self.num_spins):
        # Loop over the spectrometer frequencies.
        for mi in range(self.num_frq):
            # Loop over the offsets.
            for oi in range(self.num_offsets[ei][si][mi]):
                # Loop over the dispersion points
                for di in range(self.num_disp_points[ei][si][mi][oi]):
                    pass

This is possible, but it will require that most of the target function
data structures be much, much bigger.  I.e. converting almost all to
have the dimensions of [ei][si][mi][oi][di], in the current dispersion
notation.  It also requires that the lib.dispersion.* modules all
calculate the entire "self.back_calc[ei][si][mi][oi][di]" data
structure in a single function call.  The key is to design the numpy
arrays that are sent into the lib.dispersion.* modules so that they
can be operated on and, hence, the lib.dispersion code would not
change.  This includes everything passed into the dispersion target
function class and created in the __init__() method.  Each target
function would have to also convert the R20, dw, and phi_ex parameters
into a large numpy data structure, again with [ei][si][mi][oi][di]
dimensions.

If such a change were to be made, the speed ups would be huge!  This
would be the maximum speed up possible, as all Python loops would be
eliminated and everything would be done in numpy.  This would then be
close to the speed that you would expect if you rewrote the target
functions in either C or FORTRAN combined with the super fast blas and
lapack libraries (which are used by numpy).

Regards,

Edward


P. S.  This would also require the implementation of the 'missing'
data structure change and chi2_5D() function ideas of the parent post
(http://thread.gmane.org/gmane.science.nmr.relax.devel/5726).

On 9 May 2014 16:29, Edward d'Auvergne <edw...@nmr-relax.com> wrote:
> Hi,
>
> This post is a placeholder for ideas about how to optimise the
> handling of the self.missing data structure in the
> target_functions.relax_disp module.  The current implementation is
> very slow and has a significant impact on the speed of a dispersion
> analysis.
>
> The current logic is:
>
> # Loop over the experiment types.
> for ei in range(self.num_exp):
>     # Loop over the spins.
>     for si in range(self.num_spins):
>         # Loop over the spectrometer frequencies.
>         for mi in range(self.num_frq):
>             # Loop over the offsets.
>             for oi in range(self.num_offsets[0][si][mi]):
>                 # For all missing data points, set the back-calculated
> value to the measured values so that it has no effect on the
> chi-squared value.
>                 for di in range(self.num_disp_points[0][si][mi][oi]):
>                     if self.missing[0][si][mi][oi][di]:
>                         self.back_calc[0][si][mi][oi][di] =
> self.values[0][si][mi][oi][di]
>
>                 # Calculate the chi-squared value.
>                 chi2_sum += chi2(self.values[ei][si][mi][0],
> self.back_calc[ei][si][mi][0], self.errors[ei][si][mi][0])
>
>
> Here, the back calculated value is set to the real value when data is
> missing.  That way the chi-squared value for this element will be zero
> as the difference between the back-calculated value (set to the real
> value) and the real value is zero.
>
> The ideal would be to shift this logic to be outside of all of these
> loops.  And then to construct a chi2_multi_dim() function, which
> replaces chi2(), that will calculate the chi2-square value for all
> elements as a huge numpy data structure, then sums over all
> dimensions.  Here for the dispersion analysis, the data structures
> have 5 dimensions.  Probably for maximum speed, a series of functions
> are needed - chi2_2D(), chi2_3D(), chi2_4D() and chi2_5D().
>
> For the handling of the missing data, the following could be
> implemented.  The logic will be the same, the elements of
> self.back_calc which correspond to missing data would be set to the
> real values (self.values).  How can this be done outside of the loops?
>  Simply:
>
>         back_calc_new = self.back_calc * self.missing_inv + 
> self.missing_values
>
> For this to work, self.missing_inv and self.missing_values would need
> to be pre-created in the __init__() method as numpy float64 arrays:
>
> - self.missing_inv:  An element is zero when data is missing, and 1.0
> otherwise.  The multiplication will cause all missing elements in
> self.back_calc to be set to zero and all other values remain
> unmodified.
>
> - self.missing_values:  This will be an array full of zeros, except
> for where data is missing and then the element is set to the number in
> self.values.
>
> The last two lines of the target function would then be:
>
>         # Return the total chi-squared value.
>         return chi2_5D(self.values, self.back_calc, self.errors)
>
> The speed up of such an implementation would be very significant.  A
> halving of the calculation time would not be surprising, though that
> would depend on the dispersion model.  Anyway, I will leave such an
> implementation for anyone who is interested in speed.  Please discuss
> first.
>
> Regards,
>
> Edward

_______________________________________________
relax (http://www.nmr-relax.com)

This is the relax-devel mailing list
relax-devel@gna.org

To unsubscribe from this list, get a password
reminder, or change your subscription options,
visit the list information page at
https://mail.gna.org/listinfo/relax-devel

Re: Speeding up the relaxation dispersion analysis - optimisation of the missing data handling.

Reply via email to