#12103: Use MeatAxe as an optional back end for dense matrices over `GF(p^n)`, p
odd, n>1, `p^n<255`
-------------------------------------+-------------------------------------
Reporter: SimonKing | Owner: jason, was
Type: defect | Status: needs_work
Priority: major | Milestone: sage-6.4
Component: packages: | Resolution:
experimental | Merged in:
Keywords: linear algebra, | Reviewers:
MeatAxe | Work issues:
Authors: Simon King | Commit:
Report Upstream: None of the above | c1d5026ade5e1bb454401a34a734c91c15cbf75f
- read trac for reasoning. | Stopgaps:
Branch: |
u/SimonKing/meataxe |
Dependencies: |
-------------------------------------+-------------------------------------
Comment (by SimonKing):
I force-pushed the new version.
Changes: The bug is fixed, and FILE and PATCHES are removed.
Now for timings and the quest for a good (default) cutoff.
When using Winograd-Strassen, one needs to define a matrix size such that
divide-and-conquer is applied if and only if all resulting blocks have at
least that size.
Meataxe packs matrix coefficients densely. I.e, if you have a field of
size 3, then it will pack 5 matrix coefficients into one byte (`2^5<256`).
Each meataxe matrix is stored in a block of memory, row by row.
Internally, meataxe uses "long" operations in some functions. That's why
each row uses a byte size that is a multiple of sizeof(long), using up to
sizeof(long)-1 trailing zero bytes ("padding bytes").
Winograd-Strassen operates on matrix windows. It is rather obvious that
the window shouldn't end in the middle of a byte, i.e., the number of
columns of a window should be a multiple of the "number of coefficients
per byte".
In my old patches for version 2.2.4, in fact I used "byte" as unit. But in
the new patches for version 2.4.24, I adopt meataxe's convention that row
length should be divisible by sizeof(long). So, in my Winograd-Strassen
implementation, the number of columns of matrix windows is a multiple of
"sizeof(long)*(number of coefficients)".
Consequently, the cutoff parameter in meataxe is given in the unit
"sizeof(long)" (in my current cython wrapper the unit is byte, but that's
likely to change).
With the following function used in a Sage session, I have determined the
optimal cutoff parameter for different field sizes and different matrix
dimensions. Also I have tested for what matrix dimensions the school book
algorithm is faster than the Winograd-Strassen algorithm, and I checked
that both algorithms yield the same result.
{{{
#!python
from sage.matrix.matrix_gfpn_dense import Matrix_gfpn_dense
def find_cutoff(q, D):
print "GF({})".format(q)
for n in [100,500,1000,2000]:
print " n = {}:".format(n),
MS = MatrixSpace(GF(q,'x'), n)
MS._MatrixSpace__matrix_class = Matrix_gfpn_dense
M = MS.random_element()
t = cputime()
N = M._multiply_classical(M)
D[q,n] = set()
D[q,n].add((cputime(t),0))
for c in range(1,9):
print c*8,
sys.stdout.flush()
t = cputime()
N1 = M._multiply_strassen(M, c*8)
D[q,n].add((cputime(t), c))
assert N1==N
print
print "->",sorted(list((D[q,n])))[:5]
}}}
The result was
{{{
sage: D = {}
sage: for q in [2,3,8,9,25,27,32,49,81,125,243]: find_cutoff(q, D)
GF(2)
n = 100: 8 16 24 32 40 48 56 64
-> [(0.0, 0), (0.0, 1), (0.0, 2), (0.0, 3), (0.0, 4)]
n = 500: 8 16 24 32 40 48 56 64
-> [(0.002999999999985903, 5), (0.002999999999985903, 6),
(0.002999999999985903, 7), (0.0030000000000427463, 4),
(0.0030000000000427463, 8)]
n = 1000: 8 16 24 32 40 48 56 64
-> [(0.019000000000005457, 0), (0.01999999999998181, 8),
(0.02199999999999136, 4), (0.02199999999999136, 5), (0.02199999999999136,
7)]
n = 2000: 8 16 24 32 40 48 56 64
-> [(0.12000000000000455, 0), (0.14599999999995816, 8),
(0.1560000000000059, 4), (0.15699999999998226, 5), (0.15699999999998226,
6)]
GF(3)
n = 100: 8 16 24 32 40 48 56 64
-> [(0.0, 0), (0.0, 4), (0.0, 7), (0.0009999999999763531, 2),
(0.0009999999999763531, 3)]
n = 500: 8 16 24 32 40 48 56 64
-> [(0.04899999999997817, 0), (0.05000000000001137, 7),
(0.05099999999998772, 8), (0.051999999999964075, 5), (0.05299999999999727,
4)]
n = 1000: 8 16 24 32 40 48 56 64
-> [(0.3509999999999991, 4), (0.3509999999999991, 5), (0.3509999999999991,
6), (0.36199999999996635, 8), (0.36299999999999955, 7)]
n = 2000: 8 16 24 32 40 48 56 64
-> [(2.4659999999999513, 6), (2.4669999999999845, 4), (2.4669999999999845,
5), (2.5469999999999686, 7), (2.548000000000002, 8)]
GF(8)
n = 100: 8 16 24 32 40 48 56 64
-> [(0.0009999999999763531, 3), (0.0009999999999763531, 6),
(0.0009999999999763531, 8), (0.0010000000000331966, 1),
(0.0019999999999527063, 0)]
n = 500: 8 16 24 32 40 48 56 64
-> [(0.1169999999999618, 3), (0.117999999999995, 2), (0.1209999999999809,
5), (0.12100000000003774, 4), (0.12100000000003774, 6)]
n = 1000: 8 16 24 32 40 48 56 64
-> [(0.9129999999999541, 7), (0.9130000000000109, 4), (0.9130000000000109,
6), (0.9139999999999873, 5), (0.9519999999999982, 3)]
n = 2000: 8 16 24 32 40 48 56 64
-> [(6.291999999999973, 7), (6.29200000000003, 4), (6.293000000000006, 5),
(6.293000000000006, 6), (6.569000000000017, 2)]
GF(9)
n = 100: 8 16 24 32 40 48 56 64
-> [(0.0009999999999763531, 7), (0.0010000000000331966, 2),
(0.0010000000000331966, 4), (0.0019999999999527063, 0),
(0.0019999999999527063, 3)]
n = 500: 8 16 24 32 40 48 56 64
-> [(0.12299999999999045, 3), (0.1239999999999668, 7),
(0.12400000000002365, 2), (0.125, 4), (0.125, 5)]
n = 1000: 8 16 24 32 40 48 56 64
-> [(0.9429999999999836, 4), (0.9429999999999836, 6), (0.9430000000000405,
7), (0.9440000000000168, 5), (0.9830000000000041, 8)]
n = 2000: 8 16 24 32 40 48 56 64
-> [(6.517999999999972, 6), (6.522999999999968, 7), (6.524000000000001,
5), (6.5330000000000155, 4), (6.793999999999983, 8)]
GF(25)
n = 100: 8 16 24 32 40 48 56 64
-> [(0.002999999999985903, 0), (0.002999999999985903, 2),
(0.002999999999985903, 3), (0.002999999999985903, 4),
(0.002999999999985903, 6)]
n = 500: 8 16 24 32 40 48 56 64
-> [(0.28300000000001546, 4), (0.2839999999999918, 6),
(0.2839999999999918, 7), (0.2840000000001055, 5), (0.2949999999999591, 8)]
n = 1000: 8 16 24 32 40 48 56 64
-> [(1.9640000000000555, 7), (1.9650000000000318, 6), (1.9669999999999845,
4), (1.969000000000051, 5), (2.044999999999959, 8)]
n = 2000: 8 16 24 32 40 48 56 64
-> [(13.799999999999955, 4), (13.805000000000064, 6), (13.807000000000016,
7), (13.81499999999994, 5), (14.360000000000014, 8)]
GF(27)
n = 100: 8 16 24 32 40 48 56 64
-> [(0.0029999999999290594, 0), (0.0029999999999290594, 4),
(0.0029999999999290594, 6), (0.0030000000000427463, 2),
(0.0030000000000427463, 3)]
n = 500: 8 16 24 32 40 48 56 64
-> [(0.28099999999994907, 5), (0.28099999999994907, 7),
(0.28100000000006276, 4), (0.28100000000006276, 6), (0.29300000000000637,
8)]
n = 1000: 8 16 24 32 40 48 56 64
-> [(1.9509999999999081, 4), (1.9510000000000218, 6), (1.9510000000000218,
7), (1.9520000000001119, 5), (2.033999999999878, 8)]
n = 2000: 8 16 24 32 40 48 56 64
-> [(13.709999999999923, 6), (13.718000000000075, 7), (13.731999999999971,
5), (13.764999999999873, 4), (14.337999999999965, 8)]
GF(32)
n = 100: 8 16 24 32 40 48 56 64
-> [(0.0029999999999290594, 2), (0.0029999999999290594, 5),
(0.0029999999999290594, 7), (0.0030000000000427463, 0),
(0.0030000000000427463, 3)]
n = 500: 8 16 24 32 40 48 56 64
-> [(0.2700000000000955, 7), (0.27099999999995816, 4),
(0.27099999999995816, 6), (0.2720000000000482, 5), (0.2799999999999727,
2)]
n = 1000: 8 16 24 32 40 48 56 64
-> [(1.8730000000000473, 4), (1.893000000000029, 5), (1.8960000000000719,
7), (1.8980000000000246, 6), (1.9220000000000255, 2)]
n = 2000: 8 16 24 32 40 48 56 64
-> [(13.370999999999981, 4), (13.513000000000034, 5), (13.535999999999945,
2), (13.625, 3), (13.687000000000012, 7)]
GF(49)
n = 100: 8 16 24 32 40 48 56 64
-> [(0.0029999999999290594, 4), (0.0029999999999290594, 7),
(0.0030000000000427463, 0), (0.0030000000000427463, 2),
(0.0030000000000427463, 3)]
n = 500: 8 16 24 32 40 48 56 64
-> [(0.28300000000001546, 4), (0.2860000000000582, 6),
(0.2869999999999209, 5), (0.28700000000003456, 7), (0.2939999999999827,
8)]
n = 1000: 8 16 24 32 40 48 56 64
-> [(1.9269999999999072, 6), (1.9300000000000637, 7), (1.9329999999999927,
5), (1.9470000000000027, 4), (2.038000000000011, 8)]
n = 2000: 8 16 24 32 40 48 56 64
-> [(13.607000000000198, 6), (13.632000000000062, 7), (13.685999999999922,
5), (13.788999999999987, 4), (14.29099999999994, 2)]
GF(81)
n = 100: 8 16 24 32 40 48 56 64
-> [(0.0029999999999290594, 2), (0.0029999999999290594, 4),
(0.0029999999999290594, 5), (0.0029999999999290594, 7),
(0.0029999999999290594, 8)]
n = 500: 8 16 24 32 40 48 56 64
-> [(0.2819999999999254, 4), (0.2819999999999254, 7), (0.2840000000001055,
6), (0.28899999999998727, 5), (0.29500000000007276, 2)]
n = 1000: 8 16 24 32 40 48 56 64
-> [(1.9349999999999454, 6), (1.9409999999998035, 7), (1.9610000000000127,
4), (1.9629999999999654, 5), (2.0399999999999636, 8)]
n = 2000: 8 16 24 32 40 48 56 64
-> [(13.505999999999858, 7), (13.623000000000047, 5), (13.644000000000233,
6), (13.719000000000278, 4), (14.055000000000064, 8)]
GF(125)
n = 100: 8 16 24 32 40 48 56 64
-> [(0.0029999999999290594, 0), (0.0029999999999290594, 2),
(0.0029999999999290594, 4), (0.0029999999999290594, 5),
(0.0029999999999290594, 7)]
n = 500: 8 16 24 32 40 48 56 64
-> [(0.29599999999982174, 5), (0.2960000000000491, 4),
(0.2960000000000491, 6), (0.2960000000000491, 7), (0.3009999999999309, 8)]
n = 1000: 8 16 24 32 40 48 56 64
-> [(2.0429999999998927, 5), (2.04300000000012, 7), (2.0449999999998454,
4), (2.0450000000000728, 6), (2.0960000000000036, 8)]
n = 2000: 8 16 24 32 40 48 56 64
-> [(14.353999999999814, 7), (14.366000000000213, 5), (14.383000000000266,
4), (14.414999999999964, 6), (14.596000000000004, 8)]
GF(243)
n = 100: 8 16 24 32 40 48 56 64
-> [(0.0029999999999290594, 4), (0.0029999999999290594, 6),
(0.003000000000156433, 0), (0.003000000000156433, 2),
(0.003000000000156433, 8)]
n = 500: 8 16 24 32 40 48 56 64
-> [(0.3369999999999891, 5), (0.3369999999999891, 7),
(0.33799999999996544, 4), (0.3380000000001928, 6), (0.34799999999995634,
8)]
n = 1000: 8 16 24 32 40 48 56 64
-> [(2.336999999999989, 4), (2.336999999999989, 6), (2.336999999999989,
7), (2.341999999999871, 5), (2.4020000000000437, 8)]
n = 2000: 8 16 24 32 40 48 56 64
-> [(17.700999999999794, 4), (17.958000000000084, 5), (17.998000000000047,
7), (18.02800000000002, 2), (18.03599999999983, 6)]
}}}
'''__Evaluation of these timings__'''
- My Winograd-Strassen implementation gives the same result as meataxe's
school book implementation. '''--> Hooray'''
- There is not a single case (not even for small matrices) where meataxe's
school book implementation is significantly faster than Winograd-Strassen
'''--> Winograd-Strassen should be the default algorithm in
matrix_gfpn_dense'''
- It seems that a cutoff between 4 and 7 in the unit "sizeof(long)" are
about the same, and are generally better than other cutoffs.
The last statement is backed up by another study, where I consider
3601x3601-matrices over different fields, varying the cutoff parameter. I
got these timings for multiplication:
{{{
cutoff 1 2 3 4 5 6 7
8
GF(4): 15.5 s 12.2 s 12.3 s 11 s 11 s 11 s 11 s
11.5 s
GF(9): 41.3 s 33.7 s 33.5 s 31.6 s 31.5 s 31.5 s 31.5 s
34 s
GF(16): 26.4 s 21.1 s 21 s 19.9 s 19.9 s 19.9 s 19.9 s
21.4 s
GF(25): 1min 34s 1min 12s 1min 12s 1min 8s 1min 8s 1min 8s 1min 8s
1min 12s
GF(32): 1min 27s 1min 6s 1min 6s 1min 4s 1min 4s 1min 4s 1min 4s
1min 10s
GF(243): 1min 55s 1min 27s 1min 27s 1min 18s 1min 18s 1min 18s 1min 19s
1min 24s
}}}
Now, the real question is: What is the meaning of "4"? My guess is that
"4" is "sizeof(long)/2", so that the default cutoff parameter now is
"sizeof(long)/2". But I could be mistaken. Perhaps on a 32-bit machine,
cutoff 4 is better than cutoff "2==sizeof(long)/2", and on a 128-bit
machine, cutoff 4 is better than cutoff "8=sizeof(long)/2"? Or perhaps it
all depends on L1/L2 cache size?
'''__Request to the reviewer__'''
- Do you have access to machines of different architecture (32-bit,
128-bit, different cache sizes)? Could you try to use the function above
to find the optimal cutoff parameters?
- Meataxe uses an environment variable ASM_MMX in the file kernel-0.c. It
means that in some functions it uses assembler code in characteristic two.
By copying I use it as well, in `FfAddMapRowWindow` mentioned above. In
some comment, it says that "this assumes Intel with 4 bytes per long, but
MMX implies Intel anyway."
My questions:
- Is ASM_MMX some kind of standard variable?
- Can you test the package on a machine where ASM_MMX is used?
Note that I do have Intel, but my longs are 8 bytes, not 4.
--
Ticket URL: <http://trac.sagemath.org/ticket/12103#comment:95>
Sage <http://www.sagemath.org>
Sage: Creating a Viable Open Source Alternative to Magma, Maple, Mathematica,
and MATLAB
--
You received this message because you are subscribed to the Google Groups
"sage-trac" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/sage-trac.
For more options, visit https://groups.google.com/d/optout.