I have been investigating the accuracy of the LinearCombination class. This was
because I was using it to perform high precision computations in the Complex
class and it did not achieve the desired precision.
The basis of the class is to compute dot products: sum a_i b_i. It does this by
splitting numbers into a higher precision representation, multiplying them and
using additions to track the round-off of the standard summation. This is then
added at the end.
The case where it fails is for a linear combination whose total sum is close to
0 but the intermediate sum is far enough from zero that there is a large
difference between the exponents of summed terms and the final result. The case
is important for Complex numbers which require a computation of
log1p(x^2+y^2-1) when x^2+y^2 is close to 1 such that log(1) would be ~zero and
the true logarithm is representable to very high precision.
LinearCombination has a few things that could be changed. However I note that
the class has been written to avoid conditional statements on magnitude and so
far I have not done a speed analysis of the implications of changes. I can only
estimate based on a guess that 1 conditional (|x| < |y|) is equal to 3 floating
point operations (flops). This is a figure used by Ogita et al (2005) in the
paper upon which LinearCombination is based [1].
For reference I have also referred to Dekker (1971) [2] and Shewchuk (1997)
[3]. Dekker is widely used as the originator of split product and split
addition where a double of 53-bits is split into two doubles of 26-bits each
(1-bit is moved to the sign component) and the terms multiplied separately with
exact precision. Shewchuk's paper expands on splits into 2 parts to produce
algorithms that express a double as an expansion of n-parts. In either case the
sum of the expansion is the original number without loss of accuracy. Dekker’s
split addition requires knowing which part is larger in magnitude. An
alternative is an algorithm from Knuth that uses more flops but does not
require ordering comparators.
The algorithm in LinearCombination is an implementation of dot2 from Ogita el
al (algorithm 5.3). There is subtle difference in that the original dot2
algorithm sums the round-off from the products and the round-off from the
product summation together. The method in LinearCombination sums them
separately (using an extra variable) and adds them at the end. This actually
increases precision in my testing as the round-offs are on a different scale.
Here are the things I have tried changing:
1. Using Dekker’s algorithm to split the number. As stated earlier this
extracts two 26-bit mantissas from a 53-bit mantis (in IEEE 754 the leading bit
in front of the of the 52-bit mantissa is assumed to be 1). This is done by
multiplication by 2^s+1 with s = ceil(53/2) = 27:
big = (2^s+1) * a
a_hi = (big - (big - a))
The extra bit of precision is carried into the sign of the low part of the
split number.
This is in contrast to the method in LinearCombination that uses a simple mask
on the long representation to obtain the a_hi part in 26-bits and the lower
part will be 27 bits.
The advantage of Dekker’s method is it produces 2 parts with 26 bits in the
mantissa that can be multiplied exactly. The disadvantage is the potential for
overflow.
It also appropriately handles very small sub-normal numbers that would be
masked to create a 0 high part with all the bits in the low part using the
current method. This will have knock on effects on split multiplication which
wants the high part to be larger.
2. Using Dekker’s dot product algorithm. This maintains a running split sum of
the individual products. It requires ordering comparison per summation to
ensure the error term is appropriately maintained.
3. I have tried a variant with higher order summation using 3 doubles to
represent the expanded split sum. This would be a Klein summation [4] of the
sum. However although the implementation was better it was not significantly
better and I will not show results.
4. Using the dotK algorithm (algorithm 5.10) of Ogita et al. This method stores
all the round-off parts from the products and sums in an array. If simply
summed at the end it will match dot2 (K=2). When K>2 then repeated loops over
the array are used with a split sum to create an error free transformation of
the data. This is then summed. The transformation has the effect that data is
distilled so that it is less effected by differences in the scale of each part
to be summed.
The paper by Ogita state that sum3 is approximately 2.2x the run-time of sum2.
The error is that of a K-fold working precision summation. Thus the sum is
3-fold precision verses 2-fold precision.
Results
I created increasingly large arrays composed of the quartet of numbers [a, b,
c, d] with the condition that:
a^2 + b^2 = 1
c^2 + d^2 = 1
These are multiplied as the dot product [a, b, c, d] * [a, b, -c, -d]
So when summed the result should be 0. But there is a lot of floating-point
error. I compute the correct result using BigDecimal. The scale of BigDecimal
is clipped to -324 such that the BigDecimal does not go under Double.MIN_VALUE.
This never happens in the test conditions AFAIK but it creates a realistic
limit for the computation. I measure the error using units of least precision
(ulps).
In the following results the first column is the total length of the data. Then
each method is broken into:
Mean ulps
Standard deviation ulps
Median ups
Percentile 99 ulps
Percentile 99.9 ulps
Percentile 100 ulps (i.e. max error)
Error count (i.e. when ulps != 0)
Sign error (when the result has the wrong sign)
For all metrics lower is better. 100000 samples were run.
The methods are:
1. LinearCombination.value
2. LinearCombination.value modified to use Dekker’s split
3. Dekker’s dot product
4. Dot2s as described in Ogita el al
5. DotK with K=2
6. DotK with K=3
4 : 2.76,50.3,0,32,272,8220,40510,0 : 1.17,28.7,0,15,128,5888,21154,0 :
1.05,37.4,0,9,118,8192,19680,0 : 1.39,39.0,0,16,128,6144,23002,0 :
1.21,28.9,0,16,128,5888,21961,0 : 0.00,0.00,0,0,0,0,0,0
8 : 9.01,1.66e+03,1,37,352,524288,53469,0 : 1.96,57.2,0,18,168,12288,33874,0 :
1.46,55.9,0,15,128,16384,31287,0 : 2.32,58.1,0,23,240,12288,40110,0 :
2.42,118,0,20,192,32768,35870,0 : 0.00,0.00,0,0,0,0,0,0
12 : 5.04,280,1,40,426,73728,58035,0 : 2.53,88.2,0,23,256,23032,40911,0 :
1.96,96.3,0,16,160,23032,36646,0 : 3.36,103,0,31,272,15872,49167,0 :
2.70,89.2,0,24,266,23032,43062,0 : 0.00,0.00,0,0,0,0,0,0
16 : 4.93,207,1,40,428,45784,59624,0 : 3.44,198,0,26,248,45784,44578,0 :
1.88,63.1,0,16,152,13824,38925,0 : 4.05,106,1,36,404,13968,53602,0 :
3.64,198,0,29,276,45784,46609,0 : 0.00,0.00,0,0,0,0,0,0
20 : 5.37,209,1,44,449,42880,61169,0 : 3.45,139,0,30,300,31744,47876,0 :
2.05,85.4,0,18,191,24576,40393,0 : 4.55,133,1,42,448,31744,57253,0 :
3.73,141,0,32,336,31744,49605,0 : 0.00,0.00,0,0,0,0,0,0
24 : 5.41,241,1,44,430,66816,61880,0 : 3.98,225,0,30,294,66816,49771,0 :
1.89,43.5,0,18,192,7936,41070,0 : 4.96,226,1,42,432,66816,59481,0 :
4.20,226,1,32,332,66816,51326,0 : 0.00,0.00,0,0,0,0,0,0
28 : 7.27,424,1,48,499,106904,62698,0 : 4.37,180,1,35,388,41368,52005,0 :
2.34,74.7,0,19,198,14048,41758,0 : 6.86,330,1,48,461,61440,61403,0 :
4.58,181,1,38,396,41368,53506,0 : 0.00,0.00,0,0,0,0,0,0
32 : 7.99,802,1,50,467,245760,63450,0 : 7.73,1.00e+03,1,37,350,311296,53600,0 :
2.36,83.7,0,20,226,16384,42182,0 : 6.23,251,1,52,502,49152,62968,0 :
7.94,1.00e+03,1,39,366,311296,54999,0 : 0.00,0.00,0,0,0,0,0,0
Conclusions
Dot3 is perfect on this data. The algorithm effectively handles the poorly
composed dot product as the final summation reorders the round-off errors.
The LinearCombination method can be made approximately 30% more accurate
(measured by a drop in mean, median and 99 percentile) by switching to Dekker’s
split algorithm.
The Dekker-split variant of LinearCombination outperforms the dot2s and dot2
algorithm from the Ogita paper. All methods are doing a standard precision
summation of the round-off terms. Dot2s combines alternating product round-off
and sum round-off in a single term. Dot2 will sum all product round-off and
continue to add the summation round-off. LinearCombination keeps these separate
thus starts summing the sum round-off from zero. In the case where the total
sum is fluctuating around zero this is an advantage and the method has better
performance.
Dekker’s dot product outperforms all the other methods that use a summation to
2-fold precision. This method requires a (|a| < |b|) comparison for each
summation and 1 additional float addition over LinearCombination.value. They
share 7muitply+10add per split product and 8add per split sum. So 25n flops for
length n as per Ogita et al and so Dekker’s method is 29n flops (assuming an
absolute float compare is 3 flops). Ogita et al state that the float compare
for Dekker’s two-sum addition requires 6 flops but is up to 50% slower then the
method from Knuth requiring 6 flops without branching. I have not verified this
but with a modern processor the branch prediction and parallel computation per
cycle may change this (the paper was in 2005).
The dot3 algorithm is fine with this test. I think there is a case for
including this in LinearCombination when a higher accuracy is required without
switching to BigDecimal. Here is an example of the code for dotK. I did not
include the split, product and sum methods for brevity:
/**
* Dot product to K-fold precision.
*
* @param a Factors.
* @param b Factors.
* @param k The precision.
* @return \( \sum_i a_i b_i \).
* @throws IllegalArgumentException if the sizes of the arrays are different.
*/
public static double dotK(double[] a,
double[] b, int k) {
if (a.length != b.length) {
throw new IllegalArgumentException("Dimension mismatch: " + a.length +
" != " + b.length);
}
final int len = a.length;
if (len == 1) {
// Revert to scalar multiplication.
return a[0] * b[0];
}
// Store all round-off parts
final double[] r = new double[len * 2];
// p is the standard scalar product sum initialised with the first product
double p = a[0] * b[0];
r[0] = productLow(a[0], b[0], p);
// Remaining split products added to the current sum and round-off stored
for (int i = 1; i < len; i++) {
final double h = a[i] * b[i];
r[i] = productLow(a[i], b[i], h);
final double x = p + h;
r[i + len - 1] = sumLow(p, h, x);
p = x;
}
if (Double.isNaN(p)) {
// Either we have split infinite numbers or some coefficients were NaNs,
// return the naive implementation for the IEEE754 result
return p;
}
// Sum the round-off with the standard sum as the final component
r[r.length - 1] = p;
return sumK(r, k - 1);
}
/**
* Sum to K-fold precision.
*
* @param p Data to sum.
* @param k The precision.
* @return the sum
*/
private static double sumK(double[] p, int k) {
// k=1 will skip the vector transformation and sum in standard precision.
for (int i = 1; i < k; i++) {
vecSum(p);
}
double sum = 0;
for (final double pi : p) {
sum += pi;
}
return sum;
}
/**
* Error free vector transformation for summation.
*
* @param p Data.
*/
private static void vecSum(double[] p) {
for (int i = 1; i < p.length; i++) {
final double x = p[i] + p[i - 1];
p[i - 1] = sumLow(p[i], p[i - 1], x);
p[i] = x;
}
}
It is possible to efficiently unroll this for small vectors as has been done in
the class already. I did this in the Complex class while testing and the sumK
part looks like this:
// Cascade summation of the terms p_i
s = p0 + p1;
p0 = sumLow(p0, p1, s);
p1 = s;
s += p2;
p1 = sumLow(p1, p2, s);
p2 = s;
s += p3;
p2 = sumLow(p2, p3, s);
p3 = s;
Etc …
return p0 + p1 + p2 + p3 + ...
For discussion:
1. Leave LinearCombination alone.
2. Update LinearCombination to use Dekker’s split algorithm and the rest leave
unchanged. The split algorithm must be guarded against overflow. This adds two
float comparison operations in extraction of the split part:
if (value > SAFE_UPPER || value < -SAFE_UPPER) {
// handle overflow
return …;
}
// normal conversion
final double c = MULTIPLIER * value;
return c - (c - value);
In the nominal case this condition is false and the rest of the split should be
the same speed as the current conversion to long bits, masking and back to a
double.
The change would improve the accuracy by ensuring split multiply has no error
due to loss of spurious round-off bits from the final bit of the original
mantissa.
I note that cm4.util.FastMathCalc uses this method but the factor is (2^30) and
not the correct (2^27 + 1).
3. Add a method value(double[] a. double[] b, int k) with k-fold precision to
complement the current 2-fold precision value(double[] a. double[] b).
4. Switch to Dekker’s algorithm. However I have no error analysis of whether
this is comparable on different test sets. It is just better here.
There is a MatLab script in Ogita et al to generate vectors for multiplication.
This may be worth investigating. For now I would not recommend switching to
Dekker’s dot product unless there is evidence it is generally applicable.
5. Add methods value3(…) that use 3-fold precision and unroll the loops for
small vectors.
I think that option 2 is preferred as all split multiply methods should use the
correct split from Dekker.
I would vote for option 3 to add a higher precision version if a user desires.
Since this may be 2x slower it may not be the best option for the default.
I would opt to implement option 5 and test it with JMH to see if dot3 is 2.2x
slower than dot2 in the JVM.
Thus most of this can be put into a JMH module and tested against the current
class to see the performance effect.
The type of changes that may end up in LinearCombination comes down to whether
you want precision over speed.
Opinions welcome.
Alex
[1] http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.2.1547
<http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.2.1547>
[2] https://doi.org/10.1007/BF01397083 <https://doi.org/10.1007/BF01397083>
[3]
http://www-2.cs.cmu.edu/afs/cs/project/quake/public/papers/robust-arithmetic.ps
<http://www-2.cs.cmu.edu/afs/cs/project/quake/public/papers/robust-arithmetic.ps>
[4]
https://en.wikipedia.org/wiki/Kahan_summation_algorithm#Further_enhancements
<https://en.wikipedia.org/wiki/Kahan_summation_algorithm#Further_enhancements>