I have been investigating the accuracy of the LinearCombination class. This was because I was using it to perform high precision computations in the Complex class and it did not achieve the desired precision.
The basis of the class is to compute dot products: sum a_i b_i. It does this by splitting numbers into a higher precision representation, multiplying them and using additions to track the round-off of the standard summation. This is then added at the end. The case where it fails is for a linear combination whose total sum is close to 0 but the intermediate sum is far enough from zero that there is a large difference between the exponents of summed terms and the final result. The case is important for Complex numbers which require a computation of log1p(x^2+y^2-1) when x^2+y^2 is close to 1 such that log(1) would be ~zero and the true logarithm is representable to very high precision. LinearCombination has a few things that could be changed. However I note that the class has been written to avoid conditional statements on magnitude and so far I have not done a speed analysis of the implications of changes. I can only estimate based on a guess that 1 conditional (|x| < |y|) is equal to 3 floating point operations (flops). This is a figure used by Ogita et al (2005) in the paper upon which LinearCombination is based [1]. For reference I have also referred to Dekker (1971) [2] and Shewchuk (1997) [3]. Dekker is widely used as the originator of split product and split addition where a double of 53-bits is split into two doubles of 26-bits each (1-bit is moved to the sign component) and the terms multiplied separately with exact precision. Shewchuk's paper expands on splits into 2 parts to produce algorithms that express a double as an expansion of n-parts. In either case the sum of the expansion is the original number without loss of accuracy. Dekker’s split addition requires knowing which part is larger in magnitude. An alternative is an algorithm from Knuth that uses more flops but does not require ordering comparators. The algorithm in LinearCombination is an implementation of dot2 from Ogita el al (algorithm 5.3). There is subtle difference in that the original dot2 algorithm sums the round-off from the products and the round-off from the product summation together. The method in LinearCombination sums them separately (using an extra variable) and adds them at the end. This actually increases precision in my testing as the round-offs are on a different scale. Here are the things I have tried changing: 1. Using Dekker’s algorithm to split the number. As stated earlier this extracts two 26-bit mantissas from a 53-bit mantis (in IEEE 754 the leading bit in front of the of the 52-bit mantissa is assumed to be 1). This is done by multiplication by 2^s+1 with s = ceil(53/2) = 27: big = (2^s+1) * a a_hi = (big - (big - a)) The extra bit of precision is carried into the sign of the low part of the split number. This is in contrast to the method in LinearCombination that uses a simple mask on the long representation to obtain the a_hi part in 26-bits and the lower part will be 27 bits. The advantage of Dekker’s method is it produces 2 parts with 26 bits in the mantissa that can be multiplied exactly. The disadvantage is the potential for overflow. It also appropriately handles very small sub-normal numbers that would be masked to create a 0 high part with all the bits in the low part using the current method. This will have knock on effects on split multiplication which wants the high part to be larger. 2. Using Dekker’s dot product algorithm. This maintains a running split sum of the individual products. It requires ordering comparison per summation to ensure the error term is appropriately maintained. 3. I have tried a variant with higher order summation using 3 doubles to represent the expanded split sum. This would be a Klein summation [4] of the sum. However although the implementation was better it was not significantly better and I will not show results. 4. Using the dotK algorithm (algorithm 5.10) of Ogita et al. This method stores all the round-off parts from the products and sums in an array. If simply summed at the end it will match dot2 (K=2). When K>2 then repeated loops over the array are used with a split sum to create an error free transformation of the data. This is then summed. The transformation has the effect that data is distilled so that it is less effected by differences in the scale of each part to be summed. The paper by Ogita state that sum3 is approximately 2.2x the run-time of sum2. The error is that of a K-fold working precision summation. Thus the sum is 3-fold precision verses 2-fold precision. Results I created increasingly large arrays composed of the quartet of numbers [a, b, c, d] with the condition that: a^2 + b^2 = 1 c^2 + d^2 = 1 These are multiplied as the dot product [a, b, c, d] * [a, b, -c, -d] So when summed the result should be 0. But there is a lot of floating-point error. I compute the correct result using BigDecimal. The scale of BigDecimal is clipped to -324 such that the BigDecimal does not go under Double.MIN_VALUE. This never happens in the test conditions AFAIK but it creates a realistic limit for the computation. I measure the error using units of least precision (ulps). In the following results the first column is the total length of the data. Then each method is broken into: Mean ulps Standard deviation ulps Median ups Percentile 99 ulps Percentile 99.9 ulps Percentile 100 ulps (i.e. max error) Error count (i.e. when ulps != 0) Sign error (when the result has the wrong sign) For all metrics lower is better. 100000 samples were run. The methods are: 1. LinearCombination.value 2. LinearCombination.value modified to use Dekker’s split 3. Dekker’s dot product 4. Dot2s as described in Ogita el al 5. DotK with K=2 6. DotK with K=3 4 : 2.76,50.3,0,32,272,8220,40510,0 : 1.17,28.7,0,15,128,5888,21154,0 : 1.05,37.4,0,9,118,8192,19680,0 : 1.39,39.0,0,16,128,6144,23002,0 : 1.21,28.9,0,16,128,5888,21961,0 : 0.00,0.00,0,0,0,0,0,0 8 : 9.01,1.66e+03,1,37,352,524288,53469,0 : 1.96,57.2,0,18,168,12288,33874,0 : 1.46,55.9,0,15,128,16384,31287,0 : 2.32,58.1,0,23,240,12288,40110,0 : 2.42,118,0,20,192,32768,35870,0 : 0.00,0.00,0,0,0,0,0,0 12 : 5.04,280,1,40,426,73728,58035,0 : 2.53,88.2,0,23,256,23032,40911,0 : 1.96,96.3,0,16,160,23032,36646,0 : 3.36,103,0,31,272,15872,49167,0 : 2.70,89.2,0,24,266,23032,43062,0 : 0.00,0.00,0,0,0,0,0,0 16 : 4.93,207,1,40,428,45784,59624,0 : 3.44,198,0,26,248,45784,44578,0 : 1.88,63.1,0,16,152,13824,38925,0 : 4.05,106,1,36,404,13968,53602,0 : 3.64,198,0,29,276,45784,46609,0 : 0.00,0.00,0,0,0,0,0,0 20 : 5.37,209,1,44,449,42880,61169,0 : 3.45,139,0,30,300,31744,47876,0 : 2.05,85.4,0,18,191,24576,40393,0 : 4.55,133,1,42,448,31744,57253,0 : 3.73,141,0,32,336,31744,49605,0 : 0.00,0.00,0,0,0,0,0,0 24 : 5.41,241,1,44,430,66816,61880,0 : 3.98,225,0,30,294,66816,49771,0 : 1.89,43.5,0,18,192,7936,41070,0 : 4.96,226,1,42,432,66816,59481,0 : 4.20,226,1,32,332,66816,51326,0 : 0.00,0.00,0,0,0,0,0,0 28 : 7.27,424,1,48,499,106904,62698,0 : 4.37,180,1,35,388,41368,52005,0 : 2.34,74.7,0,19,198,14048,41758,0 : 6.86,330,1,48,461,61440,61403,0 : 4.58,181,1,38,396,41368,53506,0 : 0.00,0.00,0,0,0,0,0,0 32 : 7.99,802,1,50,467,245760,63450,0 : 7.73,1.00e+03,1,37,350,311296,53600,0 : 2.36,83.7,0,20,226,16384,42182,0 : 6.23,251,1,52,502,49152,62968,0 : 7.94,1.00e+03,1,39,366,311296,54999,0 : 0.00,0.00,0,0,0,0,0,0 Conclusions Dot3 is perfect on this data. The algorithm effectively handles the poorly composed dot product as the final summation reorders the round-off errors. The LinearCombination method can be made approximately 30% more accurate (measured by a drop in mean, median and 99 percentile) by switching to Dekker’s split algorithm. The Dekker-split variant of LinearCombination outperforms the dot2s and dot2 algorithm from the Ogita paper. All methods are doing a standard precision summation of the round-off terms. Dot2s combines alternating product round-off and sum round-off in a single term. Dot2 will sum all product round-off and continue to add the summation round-off. LinearCombination keeps these separate thus starts summing the sum round-off from zero. In the case where the total sum is fluctuating around zero this is an advantage and the method has better performance. Dekker’s dot product outperforms all the other methods that use a summation to 2-fold precision. This method requires a (|a| < |b|) comparison for each summation and 1 additional float addition over LinearCombination.value. They share 7muitply+10add per split product and 8add per split sum. So 25n flops for length n as per Ogita et al and so Dekker’s method is 29n flops (assuming an absolute float compare is 3 flops). Ogita et al state that the float compare for Dekker’s two-sum addition requires 6 flops but is up to 50% slower then the method from Knuth requiring 6 flops without branching. I have not verified this but with a modern processor the branch prediction and parallel computation per cycle may change this (the paper was in 2005). The dot3 algorithm is fine with this test. I think there is a case for including this in LinearCombination when a higher accuracy is required without switching to BigDecimal. Here is an example of the code for dotK. I did not include the split, product and sum methods for brevity: /** * Dot product to K-fold precision. * * @param a Factors. * @param b Factors. * @param k The precision. * @return \( \sum_i a_i b_i \). * @throws IllegalArgumentException if the sizes of the arrays are different. */ public static double dotK(double[] a, double[] b, int k) { if (a.length != b.length) { throw new IllegalArgumentException("Dimension mismatch: " + a.length + " != " + b.length); } final int len = a.length; if (len == 1) { // Revert to scalar multiplication. return a[0] * b[0]; } // Store all round-off parts final double[] r = new double[len * 2]; // p is the standard scalar product sum initialised with the first product double p = a[0] * b[0]; r[0] = productLow(a[0], b[0], p); // Remaining split products added to the current sum and round-off stored for (int i = 1; i < len; i++) { final double h = a[i] * b[i]; r[i] = productLow(a[i], b[i], h); final double x = p + h; r[i + len - 1] = sumLow(p, h, x); p = x; } if (Double.isNaN(p)) { // Either we have split infinite numbers or some coefficients were NaNs, // return the naive implementation for the IEEE754 result return p; } // Sum the round-off with the standard sum as the final component r[r.length - 1] = p; return sumK(r, k - 1); } /** * Sum to K-fold precision. * * @param p Data to sum. * @param k The precision. * @return the sum */ private static double sumK(double[] p, int k) { // k=1 will skip the vector transformation and sum in standard precision. for (int i = 1; i < k; i++) { vecSum(p); } double sum = 0; for (final double pi : p) { sum += pi; } return sum; } /** * Error free vector transformation for summation. * * @param p Data. */ private static void vecSum(double[] p) { for (int i = 1; i < p.length; i++) { final double x = p[i] + p[i - 1]; p[i - 1] = sumLow(p[i], p[i - 1], x); p[i] = x; } } It is possible to efficiently unroll this for small vectors as has been done in the class already. I did this in the Complex class while testing and the sumK part looks like this: // Cascade summation of the terms p_i s = p0 + p1; p0 = sumLow(p0, p1, s); p1 = s; s += p2; p1 = sumLow(p1, p2, s); p2 = s; s += p3; p2 = sumLow(p2, p3, s); p3 = s; Etc … return p0 + p1 + p2 + p3 + ... For discussion: 1. Leave LinearCombination alone. 2. Update LinearCombination to use Dekker’s split algorithm and the rest leave unchanged. The split algorithm must be guarded against overflow. This adds two float comparison operations in extraction of the split part: if (value > SAFE_UPPER || value < -SAFE_UPPER) { // handle overflow return …; } // normal conversion final double c = MULTIPLIER * value; return c - (c - value); In the nominal case this condition is false and the rest of the split should be the same speed as the current conversion to long bits, masking and back to a double. The change would improve the accuracy by ensuring split multiply has no error due to loss of spurious round-off bits from the final bit of the original mantissa. I note that cm4.util.FastMathCalc uses this method but the factor is (2^30) and not the correct (2^27 + 1). 3. Add a method value(double[] a. double[] b, int k) with k-fold precision to complement the current 2-fold precision value(double[] a. double[] b). 4. Switch to Dekker’s algorithm. However I have no error analysis of whether this is comparable on different test sets. It is just better here. There is a MatLab script in Ogita et al to generate vectors for multiplication. This may be worth investigating. For now I would not recommend switching to Dekker’s dot product unless there is evidence it is generally applicable. 5. Add methods value3(…) that use 3-fold precision and unroll the loops for small vectors. I think that option 2 is preferred as all split multiply methods should use the correct split from Dekker. I would vote for option 3 to add a higher precision version if a user desires. Since this may be 2x slower it may not be the best option for the default. I would opt to implement option 5 and test it with JMH to see if dot3 is 2.2x slower than dot2 in the JVM. Thus most of this can be put into a JMH module and tested against the current class to see the performance effect. The type of changes that may end up in LinearCombination comes down to whether you want precision over speed. Opinions welcome. Alex [1] http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.2.1547 <http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.2.1547> [2] https://doi.org/10.1007/BF01397083 <https://doi.org/10.1007/BF01397083> [3] http://www-2.cs.cmu.edu/afs/cs/project/quake/public/papers/robust-arithmetic.ps <http://www-2.cs.cmu.edu/afs/cs/project/quake/public/papers/robust-arithmetic.ps> [4] https://en.wikipedia.org/wiki/Kahan_summation_algorithm#Further_enhancements <https://en.wikipedia.org/wiki/Kahan_summation_algorithm#Further_enhancements>