[numbers] LinearCombination dot product accuracy

Alex Herbert Fri, 10 Jan 2020 09:18:18 -0800

I have been investigating the accuracy of the LinearCombination class. This was 
because I was using it to perform high precision computations in the Complex 
class and it did not achieve the desired precision.


The basis of the class is to compute dot products: sum a_i b_i. It does this by 
splitting numbers into a higher precision representation, multiplying them and 
using additions to track the round-off of the standard summation. This is then 
added at the end.

The case where it fails is for a linear combination whose total sum is close to 
0 but the intermediate sum is far enough from zero that there is a large 
difference between the exponents of summed terms and the final result. The case 
is important for Complex numbers which require a computation of 
log1p(x^2+y^2-1) when x^2+y^2 is close to 1 such that log(1) would be ~zero and 
the true logarithm is representable to very high precision.

LinearCombination has a few things that could be changed. However I note that 
the class has been written to avoid conditional statements on magnitude and so 
far I have not done a speed analysis of the implications of changes. I can only 
estimate based on a guess that 1 conditional (|x| < |y|) is equal to 3 floating 
point operations (flops). This is a figure used by Ogita et al (2005) in the 
paper upon which LinearCombination is based [1].

For reference I have also referred to Dekker (1971) [2] and Shewchuk (1997) 
[3]. Dekker is widely used as the originator of split product and split 
addition where a double of 53-bits is split into two doubles of 26-bits each 
(1-bit is moved to the sign component) and the terms multiplied separately with 
exact precision. Shewchuk's paper expands on splits into 2 parts to produce 
algorithms that express a double as an expansion of n-parts. In either case the 
sum of the expansion is the original number without loss of accuracy. Dekker’s 
split addition requires knowing which part is larger in magnitude. An 
alternative is an algorithm from Knuth that uses more flops but does not 
require ordering comparators.

The algorithm in LinearCombination is an implementation of dot2 from Ogita el 
al (algorithm 5.3). There is subtle difference in that the original dot2 
algorithm sums the round-off from the products and the round-off from the 
product summation together. The method in LinearCombination sums them 
separately (using an extra variable) and adds them at the end. This actually 
increases precision in my testing as the round-offs are on a different scale.

Here are the things I have tried changing:

1. Using Dekker’s algorithm to split the number. As stated earlier this 
extracts two 26-bit mantissas from a 53-bit mantis (in IEEE 754 the leading bit 
in front of the of the 52-bit mantissa is assumed to be 1). This is done by 
multiplication by 2^s+1 with s = ceil(53/2) = 27:

big = (2^s+1) * a
a_hi = (big - (big - a))

The extra bit of precision is carried into the sign of the low part of the 
split number.

This is in contrast to the method in LinearCombination that uses a simple mask 
on the long representation to obtain the a_hi part in 26-bits and the lower 
part will be 27 bits.

The advantage of Dekker’s method is it produces 2 parts with 26 bits in the 
mantissa that can be multiplied exactly. The disadvantage is the potential for 
overflow.

It also appropriately handles very small sub-normal numbers that would be 
masked to create a 0 high part with all the bits in the low part using the 
current method. This will have knock on effects on split multiplication which 
wants the high part to be larger.

2. Using Dekker’s dot product algorithm. This maintains a running split sum of 
the individual products. It requires ordering comparison per summation to 
ensure the error term is appropriately maintained.

3. I have tried a variant with higher order summation using 3 doubles to 
represent the expanded split sum. This would be a Klein summation [4] of the 
sum. However although the implementation was better it was not significantly 
better and I will not show results.

4. Using the dotK algorithm (algorithm 5.10) of Ogita et al. This method stores 
all the round-off parts from the products and sums in an array. If simply 
summed at the end it will match dot2 (K=2). When K>2 then repeated loops over 
the array are used with a split sum to create an error free transformation of 
the data. This is then summed. The transformation has the effect that data is 
distilled so that it is less effected by differences in the scale of each part 
to be summed.

The paper by Ogita state that sum3 is approximately 2.2x the run-time of sum2. 
The error is that of a K-fold working precision summation. Thus the sum is 
3-fold precision verses 2-fold precision.

Results

I created increasingly large arrays composed of the quartet of numbers [a, b, 
c, d] with the condition that:

a^2 + b^2 = 1
c^2 + d^2 = 1

These are multiplied as the dot product [a, b, c, d] * [a, b, -c, -d]

So when summed the result should be 0. But there is a lot of floating-point 
error. I compute the correct result using BigDecimal. The scale of BigDecimal 
is clipped to -324 such that the BigDecimal does not go under Double.MIN_VALUE. 
This never happens in the test conditions AFAIK but it creates a realistic 
limit for the computation. I measure the error using units of least precision 
(ulps).

In the following results the first column is the total length of the data. Then 
each method is broken into:

Mean ulps
Standard deviation ulps
Median ups
Percentile 99 ulps
Percentile 99.9 ulps
Percentile 100 ulps (i.e. max error)
Error count (i.e. when ulps != 0)
Sign error (when the result has the wrong sign)

For all metrics lower is better. 100000 samples were run.

The methods are:

1. LinearCombination.value
2. LinearCombination.value modified to use Dekker’s split
3. Dekker’s dot product
4. Dot2s as described in Ogita el al
5. DotK with K=2
6. DotK with K=3

4 : 2.76,50.3,0,32,272,8220,40510,0 : 1.17,28.7,0,15,128,5888,21154,0 : 
1.05,37.4,0,9,118,8192,19680,0 : 1.39,39.0,0,16,128,6144,23002,0 : 
1.21,28.9,0,16,128,5888,21961,0 : 0.00,0.00,0,0,0,0,0,0
8 : 9.01,1.66e+03,1,37,352,524288,53469,0 : 1.96,57.2,0,18,168,12288,33874,0 : 
1.46,55.9,0,15,128,16384,31287,0 : 2.32,58.1,0,23,240,12288,40110,0 : 
2.42,118,0,20,192,32768,35870,0 : 0.00,0.00,0,0,0,0,0,0
12 : 5.04,280,1,40,426,73728,58035,0 : 2.53,88.2,0,23,256,23032,40911,0 : 
1.96,96.3,0,16,160,23032,36646,0 : 3.36,103,0,31,272,15872,49167,0 : 
2.70,89.2,0,24,266,23032,43062,0 : 0.00,0.00,0,0,0,0,0,0
16 : 4.93,207,1,40,428,45784,59624,0 : 3.44,198,0,26,248,45784,44578,0 : 
1.88,63.1,0,16,152,13824,38925,0 : 4.05,106,1,36,404,13968,53602,0 : 
3.64,198,0,29,276,45784,46609,0 : 0.00,0.00,0,0,0,0,0,0
20 : 5.37,209,1,44,449,42880,61169,0 : 3.45,139,0,30,300,31744,47876,0 : 
2.05,85.4,0,18,191,24576,40393,0 : 4.55,133,1,42,448,31744,57253,0 : 
3.73,141,0,32,336,31744,49605,0 : 0.00,0.00,0,0,0,0,0,0
24 : 5.41,241,1,44,430,66816,61880,0 : 3.98,225,0,30,294,66816,49771,0 : 
1.89,43.5,0,18,192,7936,41070,0 : 4.96,226,1,42,432,66816,59481,0 : 
4.20,226,1,32,332,66816,51326,0 : 0.00,0.00,0,0,0,0,0,0
28 : 7.27,424,1,48,499,106904,62698,0 : 4.37,180,1,35,388,41368,52005,0 : 
2.34,74.7,0,19,198,14048,41758,0 : 6.86,330,1,48,461,61440,61403,0 : 
4.58,181,1,38,396,41368,53506,0 : 0.00,0.00,0,0,0,0,0,0
32 : 7.99,802,1,50,467,245760,63450,0 : 7.73,1.00e+03,1,37,350,311296,53600,0 : 
2.36,83.7,0,20,226,16384,42182,0 : 6.23,251,1,52,502,49152,62968,0 : 
7.94,1.00e+03,1,39,366,311296,54999,0 : 0.00,0.00,0,0,0,0,0,0


Conclusions

Dot3 is perfect on this data. The algorithm effectively handles the poorly 
composed dot product as the final summation reorders the round-off errors.

The LinearCombination method can be made approximately 30% more accurate 
(measured by a drop in mean, median and 99 percentile) by switching to Dekker’s 
split algorithm.

The Dekker-split variant of LinearCombination outperforms the dot2s and dot2 
algorithm from the Ogita paper. All methods are doing a standard precision 
summation of the round-off terms. Dot2s combines alternating product round-off 
and sum round-off in a single term. Dot2 will sum all product round-off and 
continue to add the summation round-off. LinearCombination keeps these separate 
thus starts summing the sum round-off from zero. In the case where the total 
sum is fluctuating around zero this is an advantage and the method has better 
performance.

Dekker’s dot product outperforms all the other methods that use a summation to 
2-fold precision. This method requires a (|a| < |b|) comparison for each 
summation and 1 additional float addition over LinearCombination.value. They 
share 7muitply+10add per split product and 8add per split sum. So 25n flops for 
length n as per Ogita et al and so Dekker’s method is 29n flops (assuming an 
absolute float compare is 3 flops). Ogita et al state that the float compare 
for Dekker’s two-sum addition requires 6 flops but is up to 50% slower then the 
method from Knuth requiring 6 flops without branching. I have not verified this 
but with a modern processor the branch prediction and parallel computation per 
cycle may change this (the paper was in 2005).

The dot3 algorithm is fine with this test. I think there is a case for 
including this in LinearCombination when a higher accuracy is required without 
switching to BigDecimal. Here is an example of the code for dotK. I did not 
include the split, product and sum methods for brevity:

/**
 * Dot product to K-fold precision.
 *
 * @param a Factors.
 * @param b Factors.
 * @param k The precision.
 * @return \( \sum_i a_i b_i \).
 * @throws IllegalArgumentException if the sizes of the arrays are different.
 */
public static double dotK(double[] a,
                          double[] b, int k) {
    if (a.length != b.length) {
        throw new IllegalArgumentException("Dimension mismatch: " + a.length + 
" != " + b.length);
    }

    final int len = a.length;

    if (len == 1) {
        // Revert to scalar multiplication.
        return a[0] * b[0];
    }

    // Store all round-off parts
    final double[] r = new double[len * 2];

    // p is the standard scalar product sum initialised with the first product
    double p = a[0] * b[0];
    r[0] = productLow(a[0], b[0], p);

    // Remaining split products added to the current sum and round-off stored
    for (int i = 1; i < len; i++) {
        final double h = a[i] * b[i];
        r[i] = productLow(a[i], b[i], h);

        final double x = p + h;
        r[i + len - 1] = sumLow(p, h, x);
        p = x;
    }

    if (Double.isNaN(p)) {
        // Either we have split infinite numbers or some coefficients were NaNs,
        // return the naive implementation for the IEEE754 result
        return p;
    }

    // Sum the round-off with the standard sum as the final component
    r[r.length - 1] = p;
    return sumK(r, k - 1);
}

/**
 * Sum to K-fold precision.
 *
 * @param p Data to sum.
 * @param k The precision.
 * @return the sum
 */
private static double sumK(double[] p, int k) {
    // k=1 will skip the vector transformation and sum in standard precision.
    for (int i = 1; i < k; i++) {
        vecSum(p);
    }
    double sum = 0;
    for (final double pi : p) {
        sum += pi;
    }
    return sum;
}

/**
 * Error free vector transformation for summation.
 *
 * @param p Data.
 */
private static void vecSum(double[] p) {
    for (int i = 1; i < p.length; i++) {
        final double x = p[i] + p[i - 1];
        p[i - 1] = sumLow(p[i], p[i - 1], x);
        p[i] = x;
    }
}

It is possible to efficiently unroll this for small vectors as has been done in 
the class already. I did this in the Complex class while testing and the sumK 
part looks like this:

// Cascade summation of the terms p_i
s = p0 + p1;
p0 = sumLow(p0, p1, s);
p1 = s;
s += p2;
p1 = sumLow(p1, p2, s);
p2 = s;
s += p3;
p2 = sumLow(p2, p3, s);
p3 = s;
Etc …
return p0 + p1 + p2 + p3 + ...


For discussion:

1. Leave LinearCombination alone.

2. Update LinearCombination to use Dekker’s split algorithm and the rest leave 
unchanged. The split algorithm must be guarded against overflow. This adds two 
float comparison operations in extraction of the split part:

if (value > SAFE_UPPER || value < -SAFE_UPPER) {
   // handle overflow
   return …;
}
// normal conversion
final double c = MULTIPLIER * value;
return c - (c - value);


In the nominal case this condition is false and the rest of the split should be 
the same speed as the current conversion to long bits, masking and back to a 
double.

The change would improve the accuracy by ensuring split multiply has no error 
due to loss of spurious round-off bits from the final bit of the original 
mantissa.

I note that cm4.util.FastMathCalc uses this method but the factor is (2^30) and 
not the correct (2^27 + 1).

3. Add a method value(double[] a. double[] b, int k) with k-fold precision to 
complement the current 2-fold precision value(double[] a. double[] b).

4. Switch to Dekker’s algorithm. However I have no error analysis of whether 
this is comparable on different test sets. It is just better here.

There is a MatLab script in Ogita et al to generate vectors for multiplication. 
This may be worth investigating. For now I would not recommend switching to 
Dekker’s dot product unless there is evidence it is generally applicable.

5. Add methods value3(…) that use 3-fold precision and unroll the loops for 
small vectors.


I think that option 2 is preferred as all split multiply methods should use the 
correct split from Dekker.

I would vote for option 3 to add a higher precision version if a user desires. 
Since this may be 2x slower it may not be the best option for the default.

I would opt to implement option 5 and test it with JMH to see if dot3 is 2.2x 
slower than dot2 in the JVM.


Thus most of this can be put into a JMH module and tested against the current 
class to see the performance effect.

The type of changes that may end up in LinearCombination comes down to whether 
you want precision over speed. 


Opinions welcome.

Alex


[1] http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.2.1547 
<http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.2.1547>
[2] https://doi.org/10.1007/BF01397083 <https://doi.org/10.1007/BF01397083>
[3] 
http://www-2.cs.cmu.edu/afs/cs/project/quake/public/papers/robust-arithmetic.ps 
<http://www-2.cs.cmu.edu/afs/cs/project/quake/public/papers/robust-arithmetic.ps>
[4] 
https://en.wikipedia.org/wiki/Kahan_summation_algorithm#Further_enhancements 
<https://en.wikipedia.org/wiki/Kahan_summation_algorithm#Further_enhancements>

[numbers] LinearCombination dot product accuracy

Reply via email to