Re: Speed up suggestion for task #7807.

Troels Emtekær Linnet Wed, 11 Jun 2014 02:56:28 -0700

Hi Edward.

Some timings.
Per spin, you have a faster method.
But I win per cluster.


1000 iterations
1 / 100 spins

Edward
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.523    0.523 <string>:1(<module>)
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    3.875    3.875 <string>:1(<module>)

Troels Tile
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.563    0.563 <string>:1(<module>)
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    2.102    2.102 <string>:1(<module>)

Troels Outer
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.546    0.546 <string>:1(<module>)
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    1.974    1.974 <string>:1(<module>)

2014-06-11 11:46 GMT+02:00 Troels Emtekær Linnet <[email protected]>:
> Hi Edward.
>
> This is a really god page!
> http://docs.scipy.org/doc/numpy/reference/ufuncs.html
>
> ""
> Tip
> The optional output arguments can be used to help you save memory for
> large calculations. If your arrays are large, complicated expressions
> can take longer than absolutely necessary due to the creation and
> (later) destruction of temporary calculation spaces. For example, the
> expression G = a * b + c is equivalent to t1 = A * B; G = T1 + C; del
> t1. It will be more quickly executed as G = A * B; add(G, C, G) which
> is the same as G = A * B; G += C.
> ""
>
> 2014-06-10 23:08 GMT+02:00 Edward d'Auvergne <[email protected]>:
>> Note that masks and numpy.ma.multiply() and numpy.ma.add() may speed
>> this up even more.  However due to overheads in the numpy masking,
>> there is a chance that this also makes the dw and R20 data structure
>> construction slower.
>>
>> Regards,
>>
>> Edward
>>
>>
>>
>> On 10 June 2014 22:36, Edward d'Auvergne <[email protected]> wrote:
>>> Hi Troels,
>>>
>>> To make things even simpler, here is what needs to be done for R20,
>>> R20A and R20B:
>>>
>>> """
>>> from numpy import abs, add, array, float64, multiply, ones, sum, zeros
>>>
>>> # Init mimic.
>>> #############
>>>
>>> # Values from Relax_disp.test_cpmg_synthetic_ns3d_to_cr72_noise_cluster.
>>> NE = 1
>>> NS = 2
>>> NM = 2
>>> NO = 1
>>> ND = 8
>>> R20A = array([  9.984626320294867,  11.495327724693091,
>>> 12.991028416082928, 14.498419290021163])
>>> shape = (NE, NS, NM, NO, ND)
>>>
>>> # Final structure for lib.dispersion.
>>> R20A_struct = zeros(shape, float64)
>>>
>>> # Temporary storage to avoid memory allocations and garbage collection.
>>> R20A_temp = zeros(shape, float64)
>>>
>>> # The structure for multiplication with R20A to piecewise build up the
>>> full R20A structure.
>>> R20A_mask = zeros((NS*NM,) + shape, float64)
>>> for si in range(NS):
>>>     for mi in range(NM):
>>>         R20A_mask[si*NM+mi, :, si, mi] = 1.0
>>> print(R20A_mask)
>>> print("\n\n")
>>>
>>> # Values to be found (again taken directly from
>>> Relax_disp.test_cpmg_synthetic_ns3d_to_cr72_noise_cluster - as a
>>> printout of dw_frq_a).
>>> R20A_final = array([[[[[  9.984626320294867,   9.984626320294867,
>>> 9.984626320294867,
>>>                           9.984626320294867,   9.984626320294867,
>>> 9.984626320294867,
>>>                           9.984626320294867,   9.984626320294867]],
>>>
>>>                       [[ 11.495327724693091,  11.495327724693091,
>>> 11.495327724693091,
>>>                          11.495327724693091,  11.495327724693091,
>>> 11.495327724693091,
>>>                          11.495327724693091,  11.495327724693091]]],
>>>
>>>
>>>                      [[[ 12.991028416082928,  12.991028416082928,
>>> 12.991028416082928,
>>>                          12.991028416082928,  12.991028416082928,
>>> 12.991028416082928,
>>>                          12.991028416082928,  12.991028416082928]],
>>>
>>>                       [[ 14.498419290021163,  14.498419290021163,
>>> 14.498419290021163,
>>>                          14.498419290021163,  14.498419290021163,
>>> 14.498419290021163,
>>>                          14.498419290021163,  14.498419290021163]]]]])
>>>
>>>
>>> # Target function.
>>> ##################
>>>
>>> # Loop over the R20A elements (one per spin).
>>> for r20_index in range(NS*NM):
>>>     # First multiply the spin specific R20A with the spin specific
>>> frequency mask, using temporary storage.
>>>     multiply(R20A[r20_index], R20A_mask[r20_index], R20A_temp)
>>>
>>>     # The add to the total.
>>>     add(R20A_struct, R20A_temp, R20A_struct)
>>>
>>> # Show that the structure is reproduced perfectly.
>>> print(R20A_struct)
>>> print(R20A_struct - R20A_final)
>>> print(sum(abs(R20A_struct - R20A_final)))
>>> """
>>>
>>>
>>> You may notice one simplification compared to my previous example for
>>> the dw parameter
>>> (http://thread.gmane.org/gmane.science.nmr.relax.devel/6135/focus=6154).
>>> The values here too come from the
>>> Relax_disp.test_cpmg_synthetic_ns3d_to_cr72_noise_cluster system test.
>>>
>>> Regards,
>>>
>>> Edward
>>>
>>>
>>> On 10 June 2014 21:31, Edward d'Auvergne <[email protected]> wrote:
>>>> Hi Troels,
>>>>
>>>> No need for an example.  Here is the code to add to your
>>>> infrastructure which will make the analytic dispersion models insanely
>>>> fast:
>>>>
>>>>
>>>> """
>>>> from numpy import add, array, float64, multiply, ones, zeros
>>>>
>>>> # Init mimic.
>>>> #############
>>>>
>>>> # Values from Relax_disp.test_cpmg_synthetic_ns3d_to_cr72_noise_cluster.
>>>> NE = 1
>>>> NS = 2
>>>> NM = 2
>>>> NO = 1
>>>> ND = 8
>>>> dw = array([ 1.847792726895652,  0.193719379085542])
>>>> frqs = [-382.188861036982701, -318.479128911056137]
>>>> shape = (NE, NS, NM, NO, ND)
>>>>
>>>> # Final structure for lib.dispersion.
>>>> dw_struct = zeros(shape, float64)
>>>>
>>>> # Temporary storage to avoid memory allocations and garbage collection.
>>>> dw_temp = zeros((NS,) + shape, float64)
>>>>
>>>> # The structure for multiplication with dw to piecewise build up the
>>>> full dw structure.
>>>> dw_mask = zeros((NS,) + shape, float64)
>>>> for si in range(NS):
>>>>     for mi in range(NM):
>>>>         dw_mask[si, :, si, mi] = frqs[mi]
>>>> print(dw_mask)
>>>>
>>>> # Values to be found (again taken directly from
>>>> Relax_disp.test_cpmg_synthetic_ns3d_to_cr72_noise_cluster - as a
>>>> printout of dw_frq_a).
>>>> dw_final = array([[[[[-706.205797724669765, -706.205797724669765,
>>>>                       -706.205797724669765, -706.205797724669765,
>>>>                       -706.205797724669765, -706.205797724669765,
>>>>                       -706.205797724669765, -706.205797724669765]],
>>>>
>>>>                     [[-588.483418069912318, -588.483418069912318,
>>>>                       -588.483418069912318, -588.483418069912318,
>>>>                       -588.483418069912318, -588.483418069912318,
>>>>                       -588.483418069912318, -588.483418069912318]]],
>>>>
>>>>
>>>>                    [[[ -74.03738885349469 ,  -74.03738885349469 ,
>>>>                        -74.03738885349469 ,  -74.03738885349469 ,
>>>>                        -74.03738885349469 ,  -74.03738885349469 ,
>>>>                        -74.03738885349469 ,  -74.03738885349469 ]],
>>>>
>>>>                     [[ -61.69557910435401 ,  -61.69557910435401 ,
>>>>                        -61.69557910435401 ,  -61.69557910435401 ,
>>>>                        -61.69557910435401 ,  -61.69557910435401 ,
>>>>                        -61.69557910435401 ,  -61.69557910435401 ]]]]])
>>>>
>>>>
>>>> # Target function.
>>>> ##################
>>>>
>>>> # Loop over the dw elements (one per spin).
>>>> for si in range(NS):
>>>>     # First multiply the spin specific dw with the spin specific
>>>> frequency mask, using temporary storage.
>>>>     multiply(dw[si], dw_mask[si], dw_temp[si])
>>>>
>>>>     # The add to the total.
>>>>     add(dw_struct, dw_temp[si], dw_struct)
>>>>
>>>> # Show that the structure is reproduced perfectly.
>>>> print(dw_struct - dw_final)
>>>> """
>>>>
>>>> As mentioned in the comments, the structures come from the
>>>> Relax_disp.test_cpmg_synthetic_ns3d_to_cr72_noise_cluster.  I just
>>>> added a check of "if len(dw) > 1: asdfasd" to kill the test, and added
>>>> printouts to obtain dw, frq_a, dw_frq_a, etc.  This is exactly the
>>>> implementation I described.  Although there might be an even faster
>>>> way, this will eliminate all numpy array creation and deletion via
>>>> Python garbage collection in the target functions (when used for R20
>>>> as well).
>>>>
>>>> Regards,
>>>>
>>>> Edward
>>>>
>>>> On 10 June 2014 21:09, Edward d'Auvergne <[email protected]> wrote:
>>>>> If you have a really complicated example of your current 'dw_frq_a'
>>>>> data structure for multiple spins and multiple fields, that could help
>>>>> to construct an example.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Edward
>>>>>
>>>>>
>>>>>
>>>>> On 10 June 2014 20:57, Edward d'Auvergne <[email protected]> wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I'll have a look tomorrow but, as you've probably seen, some of the
>>>>>> fine details such as indices to be used need to be sorted out when
>>>>>> implementing this.
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Edward
>>>>>>
>>>>>>
>>>>>> On 10 June 2014 20:49, Troels Emtekær Linnet <[email protected]> 
>>>>>> wrote:
>>>>>>> What ever I do, I cannot get this to work?
>>>>>>>
>>>>>>> Can you show an example ?
>>>>>>>
>>>>>>> 2014-06-10 16:29 GMT+02:00 Edward d'Auvergne <[email protected]>:
>>>>>>>> Here is an example of avoiding automatic numpy data structure creation
>>>>>>>> and then garbage collection:
>>>>>>>>
>>>>>>>> """
>>>>>>>> from numpy import add, ones, zeros
>>>>>>>>
>>>>>>>> a = zeros((5, 4))
>>>>>>>> a[1] = 1
>>>>>>>> a[:,1] = 2
>>>>>>>>
>>>>>>>> b = ones((5, 4))
>>>>>>>>
>>>>>>>> add(a, b, a)
>>>>>>>> print(a)
>>>>>>>> """
>>>>>>>>
>>>>>>>> The result is:
>>>>>>>>
>>>>>>>> [[ 1.  3.  1.  1.]
>>>>>>>>  [ 2.  3.  2.  2.]
>>>>>>>>  [ 1.  3.  1.  1.]
>>>>>>>>  [ 1.  3.  1.  1.]
>>>>>>>>  [ 1.  3.  1.  1.]]
>>>>>>>>
>>>>>>>> The out argument for numpy.add() is used here to operate in a similar
>>>>>>>> way to the Python "+=" operation.  But it avoids the temporary numpy
>>>>>>>> data structures that the Python "+=" operation will create.  This will
>>>>>>>> save a lot of time in the dispersion code.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>>
>>>>>>>> Edward
>>>>>>>>
>>>>>>>>
>>>>>>>> On 10 June 2014 15:56, Edward d'Auvergne <[email protected]> wrote:
>>>>>>>>> Hi Troels,
>>>>>>>>>
>>>>>>>>> Here is one suggestion, of many that I have, for significantly
>>>>>>>>> improving the speed of the analytic dispersion models in your
>>>>>>>>> 'disp_spin_speed' branch.  The speed ups you have currently achieved
>>>>>>>>> for spin clusters are huge and very impressive.  But now that you have
>>>>>>>>> the infrastructure in place, you can advance this much more!
>>>>>>>>>
>>>>>>>>> The suggestion has to do with the R20, R20A, and R20B numpy data
>>>>>>>>> structures.  They way they are currently handled is relatively
>>>>>>>>> inefficient, in that they are created de novo for each function call.
>>>>>>>>> This means that memory allocation and Python garbage collection
>>>>>>>>> happens for every single function call - something which should be
>>>>>>>>> avoided at almost all costs.
>>>>>>>>>
>>>>>>>>> A better way to do this would be to have a self.R20_struct,
>>>>>>>>> self.R20A_struct, and self.R20B_struct created in __init__(), and then
>>>>>>>>> to pack in the values from the parameter vector into these structures.
>>>>>>>>> You could create a special structure in __init__() for this.  It would
>>>>>>>>> have the dimensions [r20_index][ei][si][mi][oi], where the first
>>>>>>>>> dimension corresponds to the different R20 parameters.  And for each
>>>>>>>>> r20_index element, you would have ones at the [ei][si][mi][oi]
>>>>>>>>> positions where you would like R20 to be, and zeros elsewhere.  The
>>>>>>>>> key is that this is created at the target function start up, and not
>>>>>>>>> for each function call.
>>>>>>>>>
>>>>>>>>> This would be combined with the very powerful 'out' argument set to
>>>>>>>>> self.R20_struct with the numpy.add() and numpy.multiply() functions to
>>>>>>>>> prevent all memory allocations and garbage collection.  Masks could be
>>>>>>>>> used, but I think that that would be much slower than having special
>>>>>>>>> numpy structures with ones where R20 should be and zeros elsewhere.
>>>>>>>>> For just creating these structures, looping over a single r20_index
>>>>>>>>> loop and multiplying by the special [r20_index][ei][si][mi][oi]
>>>>>>>>> one/zero structure and using numpy.add() and numpy.multiply() with out
>>>>>>>>> arguments would be much, much faster than masks or the current
>>>>>>>>> R20_axis logic.  It will also simplify the code.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>>
>>>>>>>>> Edward
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> relax (http://www.nmr-relax.com)
>>>>>>>>
>>>>>>>> This is the relax-devel mailing list
>>>>>>>> [email protected]
>>>>>>>>
>>>>>>>> To unsubscribe from this list, get a password
>>>>>>>> reminder, or change your subscription options,
>>>>>>>> visit the list information page at
>>>>>>>> https://mail.gna.org/listinfo/relax-devel

_______________________________________________
relax (http://www.nmr-relax.com)

This is the relax-devel mailing list
[email protected]

To unsubscribe from this list, get a password
reminder, or change your subscription options,
visit the list information page at
https://mail.gna.org/listinfo/relax-devel

Re: Speed up suggestion for task #7807.

Reply via email to