#8096: Speed up parent creation for multiplication of square matrices
------------------------------+---------------------------------------------
   Reporter:  boothby         |          Owner:  boothby          
       Type:  enhancement     |         Status:  needs_review     
   Priority:  minor           |      Milestone:  sage-5.0         
  Component:  linear algebra  |       Keywords:                   
Work_issues:  rebase          |       Upstream:  N/A              
   Reviewer:                  |         Author:  boothby, robertwb
     Merged:                  |   Dependencies:                   
------------------------------+---------------------------------------------
Changes (by SimonKing):

  * status:  needs_work => needs_review


Old description:

> Multiplication of small square matrices is ridiculously slow:
> {{{
> sage: for d in range(1, 100):
> ...    print d
> ...    A = random_matrix(GF(3), d)
> ...    B = random_matrix(GF(3), d)
> ...    timeit("C = A*B")
>

> 1
> 625 loops, best of 3: 93.8 µs per loop
> 2
> 625 loops, best of 3: 93.9 µs per loop
> 3
> 625 loops, best of 3: 94.2 µs per loop
> 4
> 625 loops, best of 3: 94.1 µs per loop
> 5
> 625 loops, best of 3: 94.7 µs per loop
> 6
> 625 loops, best of 3: 94.9 µs per loop
> 7
> 625 loops, best of 3: 95.2 µs per loop
> 8
> 625 loops, best of 3: 95.8 µs per loop
> 9
> 625 loops, best of 3: 96.8 µs per loop
> 10
> 625 loops, best of 3: 97.6 µs per loop
> 11
> 625 loops, best of 3: 98.1 µs per loop
> 12
> 625 loops, best of 3: 101 µs per loop
> 13
> 625 loops, best of 3: 101 µs per loop
> 14
> 625 loops, best of 3: 104 µs per loop
> 15
> 625 loops, best of 3: 104 µs per loop
> 16
> 625 loops, best of 3: 108 µs per loop
> 17
> 625 loops, best of 3: 108 µs per loop
> 18
> 625 loops, best of 3: 113 µs per loop
> 19
> 625 loops, best of 3: 112 µs per loop
> 20
> 625 loops, best of 3: 118 µs per loop
> 21
> 625 loops, best of 3: 117 µs per loop
> 22
> 625 loops, best of 3: 125 µs per loop
> 23
> 625 loops, best of 3: 123 µs per loop
> 24
> 625 loops, best of 3: 133 µs per loop
> 25
> 625 loops, best of 3: 129 µs per loop
> 26
> 625 loops, best of 3: 143 µs per loop
> 27
> 625 loops, best of 3: 137 µs per loop
> 28
> 625 loops, best of 3: 155 µs per loop
> 29
> 625 loops, best of 3: 147 µs per loop
> 30
> 625 loops, best of 3: 166 µs per loop
> 31
> 625 loops, best of 3: 157 µs per loop
> 32
> 625 loops, best of 3: 179 µs per loop
> 33
> 625 loops, best of 3: 168 µs per loop
> 34
> 625 loops, best of 3: 196 µs per loop
> 35
> 625 loops, best of 3: 182 µs per loop
> 36
> 625 loops, best of 3: 214 µs per loop
> 37
> 625 loops, best of 3: 198 µs per loop
> 38
> 625 loops, best of 3: 234 µs per loop
> 39
> 625 loops, best of 3: 213 µs per loop
> 40
> 625 loops, best of 3: 255 µs per loop
> 41
> 625 loops, best of 3: 231 µs per loop
> 42
> 625 loops, best of 3: 279 µs per loop
> 43
> 625 loops, best of 3: 251 µs per loop
> 44
> 625 loops, best of 3: 307 µs per loop
> 45
> 625 loops, best of 3: 271 µs per loop
> 46
> 625 loops, best of 3: 335 µs per loop
> 47
> 625 loops, best of 3: 298 µs per loop
> 48
> 625 loops, best of 3: 363 µs per loop
> 49
> 625 loops, best of 3: 319 µs per loop
> 50
> 625 loops, best of 3: 401 µs per loop
> 51
> 625 loops, best of 3: 346 µs per loop
> }}}
>
> Here's a profile of the 1x1 case:
>
> {{{
> 625 loops, best of 3: 91.7 µs per loop
>          810004 function calls in 5.980 CPU seconds
>
>    Ordered by: standard name
>
>    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
>     40000    0.100    0.000    0.100    0.000 :0(IntegerMod)
>     30000    0.080    0.000    0.080    0.000 :0(append)
>     10000    0.030    0.000    0.030    0.000 :0(base_ring)
>     10000    0.150    0.000    0.900    0.000 :0(category)
>     40000    0.250    0.000    0.290    0.000 :0(has_key)
>     10000    0.070    0.000    0.070    0.000 :0(hasattr)
>     10000    0.030    0.000    0.030    0.000 :0(is_Matrix)
>     80000    0.250    0.000    0.250    0.000 :0(isinstance)
>     30000    0.070    0.000    0.070    0.000 :0(keys)
>     30000    0.040    0.000    0.040    0.000 :0(len)
>     30001    0.080    0.000    0.080    0.000 :0(range)
>     10000    0.030    0.000    0.030    0.000 :0(setdefault)
>         1    0.000    0.000    0.000    0.000 :0(setprofile)
>     30000    0.140    0.000    0.140    0.000 :0(sorted)
>         1    0.380    0.380    5.980    5.980 <string>:1(<module>)
>     30000    0.250    0.000    1.750    0.000 cachefunc.py:155(get_key)
>     10000    0.060    0.000    0.850    0.000 cachefunc.py:220(__call__)
>     10000    0.060    0.000    0.090    0.000 cachefunc.py:254(get_cache)
>     10000    0.050    0.000    0.050    0.000 cachefunc.py:275(__get__)
>     20000    0.270    0.000    1.550    0.000 cachefunc.py:76(__call__)
>     20000    0.060    0.000    0.060    0.000 cachefunc.py:95(get_cache)
>     10000    0.070    0.000    2.520    0.000
> category.py:459(__contains__)
>     10000    0.360    0.000    1.550    0.000
> category.py:651(is_subcategory)
>     20000    0.120    0.000    1.670    0.000
> classcall_metaclass.py:64(__call__)
>     20000    0.040    0.000    0.040    0.000
> finite_field_prime_modn.py:121(characteristic)
>     20000    0.110    0.000    0.110    0.000
> finite_field_prime_modn.py:187(order)
>     30000    1.010    0.000    1.500    0.000
> function_mangling.py:205(fix_to_pos)
>     30000    0.080    0.000    0.080    0.000
> function_mangling.py:261(<genexpr>)
>     40000    0.290    0.000    0.390    0.000
> integer_mod_ring.py:733(__call__)
>     10000    0.050    0.000    0.070    0.000
> matrix_group_element.py:68(is_MatrixGroupElement)
>     10000    0.450    0.000    0.710    0.000
> matrix_space.py:1035(matrix)
>     10000    0.140    0.000    4.000    0.000
> matrix_space.py:1089(matrix_space)
>     10000    0.200    0.000    3.830    0.000
> matrix_space.py:110(MatrixSpace)
>     10000    0.000    0.000    0.000    0.000 matrix_space.py:1112(ncols)
>     10000    0.030    0.000    0.030    0.000 matrix_space.py:1124(nrows)
>     10000    0.310    0.000    1.270    0.000
> matrix_space.py:271(__call__)
>     10000    0.000    0.000    0.000    0.000 misc.py:514(get_verbose)
>         1    0.000    0.000    5.980    5.980 profile:0(for i in
> range(10000): C = A*B)
>         0    0.000             0.000          profile:0(profiler)
>     90000    0.270    0.000    0.270    0.000
> unique_representation.py:454(__eq__)
> }}}

New description:

 Multiplication of small square matrices is ridiculously slow:
 {{{
 sage: for d in range(1, 100):
 ...    print d
 ...    A = random_matrix(GF(3), d)
 ...    B = random_matrix(GF(3), d)
 ...    timeit("C = A*B")


 1
 625 loops, best of 3: 93.8 µs per loop
 2
 625 loops, best of 3: 93.9 µs per loop
 3
 625 loops, best of 3: 94.2 µs per loop
 4
 625 loops, best of 3: 94.1 µs per loop
 5
 625 loops, best of 3: 94.7 µs per loop
 6
 625 loops, best of 3: 94.9 µs per loop
 7
 625 loops, best of 3: 95.2 µs per loop
 8
 625 loops, best of 3: 95.8 µs per loop
 9
 625 loops, best of 3: 96.8 µs per loop
 10
 625 loops, best of 3: 97.6 µs per loop
 11
 625 loops, best of 3: 98.1 µs per loop
 12
 625 loops, best of 3: 101 µs per loop
 13
 625 loops, best of 3: 101 µs per loop
 14
 625 loops, best of 3: 104 µs per loop
 15
 625 loops, best of 3: 104 µs per loop
 16
 625 loops, best of 3: 108 µs per loop
 17
 625 loops, best of 3: 108 µs per loop
 18
 625 loops, best of 3: 113 µs per loop
 19
 625 loops, best of 3: 112 µs per loop
 20
 625 loops, best of 3: 118 µs per loop
 21
 625 loops, best of 3: 117 µs per loop
 22
 625 loops, best of 3: 125 µs per loop
 23
 625 loops, best of 3: 123 µs per loop
 24
 625 loops, best of 3: 133 µs per loop
 25
 625 loops, best of 3: 129 µs per loop
 26
 625 loops, best of 3: 143 µs per loop
 27
 625 loops, best of 3: 137 µs per loop
 28
 625 loops, best of 3: 155 µs per loop
 29
 625 loops, best of 3: 147 µs per loop
 30
 625 loops, best of 3: 166 µs per loop
 31
 625 loops, best of 3: 157 µs per loop
 32
 625 loops, best of 3: 179 µs per loop
 33
 625 loops, best of 3: 168 µs per loop
 34
 625 loops, best of 3: 196 µs per loop
 35
 625 loops, best of 3: 182 µs per loop
 36
 625 loops, best of 3: 214 µs per loop
 37
 625 loops, best of 3: 198 µs per loop
 38
 625 loops, best of 3: 234 µs per loop
 39
 625 loops, best of 3: 213 µs per loop
 40
 625 loops, best of 3: 255 µs per loop
 41
 625 loops, best of 3: 231 µs per loop
 42
 625 loops, best of 3: 279 µs per loop
 43
 625 loops, best of 3: 251 µs per loop
 44
 625 loops, best of 3: 307 µs per loop
 45
 625 loops, best of 3: 271 µs per loop
 46
 625 loops, best of 3: 335 µs per loop
 47
 625 loops, best of 3: 298 µs per loop
 48
 625 loops, best of 3: 363 µs per loop
 49
 625 loops, best of 3: 319 µs per loop
 50
 625 loops, best of 3: 401 µs per loop
 51
 625 loops, best of 3: 346 µs per loop
 }}}

 Here's a profile of the 1x1 case:

 {{{
 625 loops, best of 3: 91.7 µs per loop
          810004 function calls in 5.980 CPU seconds

    Ordered by: standard name

    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     40000    0.100    0.000    0.100    0.000 :0(IntegerMod)
     30000    0.080    0.000    0.080    0.000 :0(append)
     10000    0.030    0.000    0.030    0.000 :0(base_ring)
     10000    0.150    0.000    0.900    0.000 :0(category)
     40000    0.250    0.000    0.290    0.000 :0(has_key)
     10000    0.070    0.000    0.070    0.000 :0(hasattr)
     10000    0.030    0.000    0.030    0.000 :0(is_Matrix)
     80000    0.250    0.000    0.250    0.000 :0(isinstance)
     30000    0.070    0.000    0.070    0.000 :0(keys)
     30000    0.040    0.000    0.040    0.000 :0(len)
     30001    0.080    0.000    0.080    0.000 :0(range)
     10000    0.030    0.000    0.030    0.000 :0(setdefault)
         1    0.000    0.000    0.000    0.000 :0(setprofile)
     30000    0.140    0.000    0.140    0.000 :0(sorted)
         1    0.380    0.380    5.980    5.980 <string>:1(<module>)
     30000    0.250    0.000    1.750    0.000 cachefunc.py:155(get_key)
     10000    0.060    0.000    0.850    0.000 cachefunc.py:220(__call__)
     10000    0.060    0.000    0.090    0.000 cachefunc.py:254(get_cache)
     10000    0.050    0.000    0.050    0.000 cachefunc.py:275(__get__)
     20000    0.270    0.000    1.550    0.000 cachefunc.py:76(__call__)
     20000    0.060    0.000    0.060    0.000 cachefunc.py:95(get_cache)
     10000    0.070    0.000    2.520    0.000
 category.py:459(__contains__)
     10000    0.360    0.000    1.550    0.000
 category.py:651(is_subcategory)
     20000    0.120    0.000    1.670    0.000
 classcall_metaclass.py:64(__call__)
     20000    0.040    0.000    0.040    0.000
 finite_field_prime_modn.py:121(characteristic)
     20000    0.110    0.000    0.110    0.000
 finite_field_prime_modn.py:187(order)
     30000    1.010    0.000    1.500    0.000
 function_mangling.py:205(fix_to_pos)
     30000    0.080    0.000    0.080    0.000
 function_mangling.py:261(<genexpr>)
     40000    0.290    0.000    0.390    0.000
 integer_mod_ring.py:733(__call__)
     10000    0.050    0.000    0.070    0.000
 matrix_group_element.py:68(is_MatrixGroupElement)
     10000    0.450    0.000    0.710    0.000 matrix_space.py:1035(matrix)
     10000    0.140    0.000    4.000    0.000
 matrix_space.py:1089(matrix_space)
     10000    0.200    0.000    3.830    0.000
 matrix_space.py:110(MatrixSpace)
     10000    0.000    0.000    0.000    0.000 matrix_space.py:1112(ncols)
     10000    0.030    0.000    0.030    0.000 matrix_space.py:1124(nrows)
     10000    0.310    0.000    1.270    0.000
 matrix_space.py:271(__call__)
     10000    0.000    0.000    0.000    0.000 misc.py:514(get_verbose)
         1    0.000    0.000    5.980    5.980 profile:0(for i in
 range(10000): C = A*B)
         0    0.000             0.000          profile:0(profiler)
     90000    0.270    0.000    0.270    0.000
 unique_representation.py:454(__eq__)
 }}}

 Apply [attachment:trac8096_speedup_matrix_parent.patch]

--

Comment:

 I have rebased the patches, combining them into one patch. In fact, a part
 (but not all) of the second patch is not needed anymore, since the
 creation of the zero matrix has been improved in another ticket.

 Here is evidence that the patch still does improve the timings for the
 computation of small matrices - even after we have switched to linbox:

 Without patch:
 {{{
 sage: d = 1
 sage: A = random_matrix(GF(3), d)
 sage: B = random_matrix(GF(3), d)
 sage: timeit("C = A*B")
 625 loops, best of 3: 40.5 µs per loop
 }}}

 With the patch:
 {{{
 sage: d = 1
 sage: A = random_matrix(GF(3), d)
 sage: B = random_matrix(GF(3), d)
 sage: timeit("C = A*B")
 625 loops, best of 3: 18.5 µs per loop
 }}}

 I need to run the doctests, though. But needs review.

 Apply trac8096_speedup_matrix_parent.patch

-- 
Ticket URL: <http://trac.sagemath.org/sage_trac/ticket/8096#comment:40>
Sage <http://www.sagemath.org>
Sage: Creating a Viable Open Source Alternative to Magma, Maple, Mathematica, 
and MATLAB

-- 
You received this message because you are subscribed to the Google Groups 
"sage-trac" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/sage-trac?hl=en.

Reply via email to