#12103: Use MeatAxe as an optional back end for dense matrices over `GF(p^n)`, p
odd, n>1, `p^n<255`
-------------------------------------+-------------------------------------
       Reporter:  SimonKing          |        Owner:  jason, was
           Type:  defect             |       Status:  needs_work
       Priority:  major              |    Milestone:  sage-6.4
      Component:  packages:          |   Resolution:
  experimental                       |    Merged in:
       Keywords:  linear algebra,    |    Reviewers:
  MeatAxe                            |  Work issues:
        Authors:  Simon King         |       Commit:
Report Upstream:  None of the above  |  c1d5026ade5e1bb454401a34a734c91c15cbf75f
  - read trac for reasoning.         |     Stopgaps:
         Branch:                     |
  u/SimonKing/meataxe                |
   Dependencies:                     |
-------------------------------------+-------------------------------------

Comment (by SimonKing):

 I force-pushed the new version.

 Changes: The bug is fixed, and FILE and PATCHES are removed.

 Now for timings and the quest for a good (default) cutoff.

 When using Winograd-Strassen, one needs to define a matrix size such that
 divide-and-conquer is applied if and only if all resulting blocks have at
 least that size.

 Meataxe packs matrix coefficients densely. I.e, if you have a field of
 size 3, then it will pack 5 matrix coefficients into one byte (`2^5<256`).

 Each meataxe matrix is stored in a block of memory, row by row.
 Internally, meataxe uses "long" operations in some functions. That's why
 each row uses a byte size that is a multiple of sizeof(long), using up to
 sizeof(long)-1 trailing zero bytes ("padding bytes").

 Winograd-Strassen operates on matrix windows. It is rather obvious that
 the window shouldn't end in the middle of a byte, i.e., the number of
 columns of a window should be a multiple of the "number of coefficients
 per byte".

 In my old patches for version 2.2.4, in fact I used "byte" as unit. But in
 the new patches for version 2.4.24, I adopt meataxe's convention that row
 length should be divisible by sizeof(long). So, in my Winograd-Strassen
 implementation, the number of columns of matrix windows is a multiple of
 "sizeof(long)*(number of coefficients)".

 Consequently, the cutoff parameter in meataxe is given in the unit
 "sizeof(long)" (in my current cython wrapper the unit is byte, but that's
 likely to change).

 With the following function used in a Sage session, I have determined the
 optimal cutoff parameter for different field sizes and different matrix
 dimensions. Also I have tested for what matrix dimensions the school book
 algorithm is faster than the Winograd-Strassen algorithm, and I checked
 that both algorithms yield the same result.
 {{{
 #!python
 from sage.matrix.matrix_gfpn_dense import Matrix_gfpn_dense
 def find_cutoff(q, D):
     print "GF({})".format(q)
     for n in [100,500,1000,2000]:
         print "  n = {}:".format(n),
         MS = MatrixSpace(GF(q,'x'), n)
         MS._MatrixSpace__matrix_class = Matrix_gfpn_dense
         M = MS.random_element()
         t = cputime()
         N = M._multiply_classical(M)
         D[q,n] = set()
         D[q,n].add((cputime(t),0))
         for c in range(1,9):
             print c*8,
             sys.stdout.flush()
             t = cputime()
             N1 = M._multiply_strassen(M, c*8)
             D[q,n].add((cputime(t), c))
             assert N1==N
         print
         print "->",sorted(list((D[q,n])))[:5]
 }}}
 The result was
 {{{
 sage: D = {}
 sage: for q in [2,3,8,9,25,27,32,49,81,125,243]: find_cutoff(q, D)
 GF(2)
   n = 100: 8 16 24 32 40 48 56 64
 -> [(0.0, 0), (0.0, 1), (0.0, 2), (0.0, 3), (0.0, 4)]
   n = 500: 8 16 24 32 40 48 56 64
 -> [(0.002999999999985903, 5), (0.002999999999985903, 6),
 (0.002999999999985903, 7), (0.0030000000000427463, 4),
 (0.0030000000000427463, 8)]
   n = 1000: 8 16 24 32 40 48 56 64
 -> [(0.019000000000005457, 0), (0.01999999999998181, 8),
 (0.02199999999999136, 4), (0.02199999999999136, 5), (0.02199999999999136,
 7)]
   n = 2000: 8 16 24 32 40 48 56 64
 -> [(0.12000000000000455, 0), (0.14599999999995816, 8),
 (0.1560000000000059, 4), (0.15699999999998226, 5), (0.15699999999998226,
 6)]
 GF(3)
   n = 100: 8 16 24 32 40 48 56 64
 -> [(0.0, 0), (0.0, 4), (0.0, 7), (0.0009999999999763531, 2),
 (0.0009999999999763531, 3)]
   n = 500: 8 16 24 32 40 48 56 64
 -> [(0.04899999999997817, 0), (0.05000000000001137, 7),
 (0.05099999999998772, 8), (0.051999999999964075, 5), (0.05299999999999727,
 4)]
   n = 1000: 8 16 24 32 40 48 56 64
 -> [(0.3509999999999991, 4), (0.3509999999999991, 5), (0.3509999999999991,
 6), (0.36199999999996635, 8), (0.36299999999999955, 7)]
   n = 2000: 8 16 24 32 40 48 56 64
 -> [(2.4659999999999513, 6), (2.4669999999999845, 4), (2.4669999999999845,
 5), (2.5469999999999686, 7), (2.548000000000002, 8)]
 GF(8)
   n = 100: 8 16 24 32 40 48 56 64
 -> [(0.0009999999999763531, 3), (0.0009999999999763531, 6),
 (0.0009999999999763531, 8), (0.0010000000000331966, 1),
 (0.0019999999999527063, 0)]
   n = 500: 8 16 24 32 40 48 56 64
 -> [(0.1169999999999618, 3), (0.117999999999995, 2), (0.1209999999999809,
 5), (0.12100000000003774, 4), (0.12100000000003774, 6)]
   n = 1000: 8 16 24 32 40 48 56 64
 -> [(0.9129999999999541, 7), (0.9130000000000109, 4), (0.9130000000000109,
 6), (0.9139999999999873, 5), (0.9519999999999982, 3)]
   n = 2000: 8 16 24 32 40 48 56 64
 -> [(6.291999999999973, 7), (6.29200000000003, 4), (6.293000000000006, 5),
 (6.293000000000006, 6), (6.569000000000017, 2)]
 GF(9)
   n = 100: 8 16 24 32 40 48 56 64
 -> [(0.0009999999999763531, 7), (0.0010000000000331966, 2),
 (0.0010000000000331966, 4), (0.0019999999999527063, 0),
 (0.0019999999999527063, 3)]
   n = 500: 8 16 24 32 40 48 56 64
 -> [(0.12299999999999045, 3), (0.1239999999999668, 7),
 (0.12400000000002365, 2), (0.125, 4), (0.125, 5)]
   n = 1000: 8 16 24 32 40 48 56 64
 -> [(0.9429999999999836, 4), (0.9429999999999836, 6), (0.9430000000000405,
 7), (0.9440000000000168, 5), (0.9830000000000041, 8)]
   n = 2000: 8 16 24 32 40 48 56 64
 -> [(6.517999999999972, 6), (6.522999999999968, 7), (6.524000000000001,
 5), (6.5330000000000155, 4), (6.793999999999983, 8)]
 GF(25)
   n = 100: 8 16 24 32 40 48 56 64
 -> [(0.002999999999985903, 0), (0.002999999999985903, 2),
 (0.002999999999985903, 3), (0.002999999999985903, 4),
 (0.002999999999985903, 6)]
   n = 500: 8 16 24 32 40 48 56 64
 -> [(0.28300000000001546, 4), (0.2839999999999918, 6),
 (0.2839999999999918, 7), (0.2840000000001055, 5), (0.2949999999999591, 8)]
   n = 1000: 8 16 24 32 40 48 56 64
 -> [(1.9640000000000555, 7), (1.9650000000000318, 6), (1.9669999999999845,
 4), (1.969000000000051, 5), (2.044999999999959, 8)]
   n = 2000: 8 16 24 32 40 48 56 64
 -> [(13.799999999999955, 4), (13.805000000000064, 6), (13.807000000000016,
 7), (13.81499999999994, 5), (14.360000000000014, 8)]
 GF(27)
   n = 100: 8 16 24 32 40 48 56 64
 -> [(0.0029999999999290594, 0), (0.0029999999999290594, 4),
 (0.0029999999999290594, 6), (0.0030000000000427463, 2),
 (0.0030000000000427463, 3)]
   n = 500: 8 16 24 32 40 48 56 64
 -> [(0.28099999999994907, 5), (0.28099999999994907, 7),
 (0.28100000000006276, 4), (0.28100000000006276, 6), (0.29300000000000637,
 8)]
   n = 1000: 8 16 24 32 40 48 56 64
 -> [(1.9509999999999081, 4), (1.9510000000000218, 6), (1.9510000000000218,
 7), (1.9520000000001119, 5), (2.033999999999878, 8)]
   n = 2000: 8 16 24 32 40 48 56 64
 -> [(13.709999999999923, 6), (13.718000000000075, 7), (13.731999999999971,
 5), (13.764999999999873, 4), (14.337999999999965, 8)]
 GF(32)
   n = 100: 8 16 24 32 40 48 56 64
 -> [(0.0029999999999290594, 2), (0.0029999999999290594, 5),
 (0.0029999999999290594, 7), (0.0030000000000427463, 0),
 (0.0030000000000427463, 3)]
   n = 500: 8 16 24 32 40 48 56 64
 -> [(0.2700000000000955, 7), (0.27099999999995816, 4),
 (0.27099999999995816, 6), (0.2720000000000482, 5), (0.2799999999999727,
 2)]
   n = 1000: 8 16 24 32 40 48 56 64
 -> [(1.8730000000000473, 4), (1.893000000000029, 5), (1.8960000000000719,
 7), (1.8980000000000246, 6), (1.9220000000000255, 2)]
   n = 2000: 8 16 24 32 40 48 56 64
 -> [(13.370999999999981, 4), (13.513000000000034, 5), (13.535999999999945,
 2), (13.625, 3), (13.687000000000012, 7)]
 GF(49)
   n = 100: 8 16 24 32 40 48 56 64
 -> [(0.0029999999999290594, 4), (0.0029999999999290594, 7),
 (0.0030000000000427463, 0), (0.0030000000000427463, 2),
 (0.0030000000000427463, 3)]
   n = 500: 8 16 24 32 40 48 56 64
 -> [(0.28300000000001546, 4), (0.2860000000000582, 6),
 (0.2869999999999209, 5), (0.28700000000003456, 7), (0.2939999999999827,
 8)]
   n = 1000: 8 16 24 32 40 48 56 64
 -> [(1.9269999999999072, 6), (1.9300000000000637, 7), (1.9329999999999927,
 5), (1.9470000000000027, 4), (2.038000000000011, 8)]
   n = 2000: 8 16 24 32 40 48 56 64
 -> [(13.607000000000198, 6), (13.632000000000062, 7), (13.685999999999922,
 5), (13.788999999999987, 4), (14.29099999999994, 2)]
 GF(81)
   n = 100: 8 16 24 32 40 48 56 64
 -> [(0.0029999999999290594, 2), (0.0029999999999290594, 4),
 (0.0029999999999290594, 5), (0.0029999999999290594, 7),
 (0.0029999999999290594, 8)]
   n = 500: 8 16 24 32 40 48 56 64
 -> [(0.2819999999999254, 4), (0.2819999999999254, 7), (0.2840000000001055,
 6), (0.28899999999998727, 5), (0.29500000000007276, 2)]
   n = 1000: 8 16 24 32 40 48 56 64
 -> [(1.9349999999999454, 6), (1.9409999999998035, 7), (1.9610000000000127,
 4), (1.9629999999999654, 5), (2.0399999999999636, 8)]
   n = 2000: 8 16 24 32 40 48 56 64
 -> [(13.505999999999858, 7), (13.623000000000047, 5), (13.644000000000233,
 6), (13.719000000000278, 4), (14.055000000000064, 8)]
 GF(125)
   n = 100: 8 16 24 32 40 48 56 64
 -> [(0.0029999999999290594, 0), (0.0029999999999290594, 2),
 (0.0029999999999290594, 4), (0.0029999999999290594, 5),
 (0.0029999999999290594, 7)]
   n = 500: 8 16 24 32 40 48 56 64
 -> [(0.29599999999982174, 5), (0.2960000000000491, 4),
 (0.2960000000000491, 6), (0.2960000000000491, 7), (0.3009999999999309, 8)]
   n = 1000: 8 16 24 32 40 48 56 64
 -> [(2.0429999999998927, 5), (2.04300000000012, 7), (2.0449999999998454,
 4), (2.0450000000000728, 6), (2.0960000000000036, 8)]
   n = 2000: 8 16 24 32 40 48 56 64
 -> [(14.353999999999814, 7), (14.366000000000213, 5), (14.383000000000266,
 4), (14.414999999999964, 6), (14.596000000000004, 8)]
 GF(243)
   n = 100: 8 16 24 32 40 48 56 64
 -> [(0.0029999999999290594, 4), (0.0029999999999290594, 6),
 (0.003000000000156433, 0), (0.003000000000156433, 2),
 (0.003000000000156433, 8)]
   n = 500: 8 16 24 32 40 48 56 64
 -> [(0.3369999999999891, 5), (0.3369999999999891, 7),
 (0.33799999999996544, 4), (0.3380000000001928, 6), (0.34799999999995634,
 8)]
   n = 1000: 8 16 24 32 40 48 56 64
 -> [(2.336999999999989, 4), (2.336999999999989, 6), (2.336999999999989,
 7), (2.341999999999871, 5), (2.4020000000000437, 8)]
   n = 2000: 8 16 24 32 40 48 56 64
 -> [(17.700999999999794, 4), (17.958000000000084, 5), (17.998000000000047,
 7), (18.02800000000002, 2), (18.03599999999983, 6)]
 }}}

 '''__Evaluation of these timings__'''

 - My Winograd-Strassen implementation gives the same result as meataxe's
 school book implementation. '''--> Hooray'''
 - There is not a single case (not even for small matrices) where meataxe's
 school book implementation is significantly faster than Winograd-Strassen
 '''--> Winograd-Strassen should be the default algorithm in
 matrix_gfpn_dense'''
 - It seems that a cutoff between 4 and 7 in the unit "sizeof(long)" are
 about the same, and are generally better than other cutoffs.

 The last statement is backed up by another study, where I consider
 3601x3601-matrices over different fields, varying the cutoff parameter. I
 got these timings for multiplication:
 {{{
 cutoff       1       2        3       4        5        6        7
 8
 GF(4):     15.5 s   12.2 s  12.3 s   11 s     11 s     11 s     11 s
 11.5 s
 GF(9):     41.3 s   33.7 s  33.5 s   31.6 s   31.5 s   31.5 s   31.5 s
 34 s
 GF(16):    26.4 s   21.1 s   21 s    19.9 s   19.9 s   19.9 s   19.9 s
 21.4 s
 GF(25):   1min 34s 1min 12s 1min 12s 1min 8s  1min 8s  1min 8s  1min 8s
 1min 12s
 GF(32):   1min 27s 1min 6s  1min 6s  1min 4s  1min 4s  1min 4s  1min 4s
 1min 10s
 GF(243):  1min 55s 1min 27s 1min 27s 1min 18s 1min 18s 1min 18s 1min 19s
 1min 24s
 }}}

 Now, the real question is: What is the meaning of "4"? My guess is that
 "4" is "sizeof(long)/2", so that the default cutoff parameter now is
 "sizeof(long)/2". But I could be mistaken. Perhaps on a 32-bit machine,
 cutoff 4 is better than cutoff "2==sizeof(long)/2", and on a 128-bit
 machine, cutoff 4 is better than cutoff "8=sizeof(long)/2"? Or perhaps it
 all depends on L1/L2 cache size?

 '''__Request to the reviewer__'''

 - Do you have access to machines of different architecture (32-bit,
 128-bit, different cache sizes)? Could you try to use the function above
 to find the optimal cutoff parameters?
 - Meataxe uses an environment variable ASM_MMX in the file kernel-0.c. It
 means that in some functions it uses assembler code in characteristic two.
 By copying I use it as well, in `FfAddMapRowWindow` mentioned above. In
 some comment, it says that "this assumes Intel with 4 bytes per long, but
 MMX implies Intel anyway."

   My questions:

     - Is ASM_MMX some kind of standard variable?
     - Can you test the package on a machine where ASM_MMX is used?

   Note that I do have Intel, but my longs are 8 bytes, not 4.

--
Ticket URL: <http://trac.sagemath.org/ticket/12103#comment:95>
Sage <http://www.sagemath.org>
Sage: Creating a Viable Open Source Alternative to Magma, Maple, Mathematica, 
and MATLAB

-- 
You received this message because you are subscribed to the Google Groups 
"sage-trac" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/sage-trac.
For more options, visit https://groups.google.com/d/optout.

Reply via email to