Hi Alessandro,

A lot has been written about this. Google "precision of summing floating point values" and read the .pdfs on the first page for some analysis. Follow the citations for more.

Somewhere there is a paper that analyses the performance of different methods and suggests the optimal approach. I think it is signal dependent.


On 10/12/2012 11:32 AM, Alessandro Saccoia wrote:
It's going to be a sort of comulative process that goes on in time,
so I won't necessarily know N in advance. If had a strong evidence
that I should prefer one method over the other, I could decide to
keep all the temporary X1, X2,… and recompute everything each time.
Performance and storage are not a strong concern in this case, but
the quality of the final and intermediate results is.

Dividing by N only when you need to compute the sum sounds like a good good idea, that way you won't be hashing the precision of each value prior to summing it.

To avoid throwing out any information you could use high-precision arithemetic. Did you say whether the signals are originally integers or floating point? If they're integers, can you keep the sum in a 64 bit int? Otherwise maybe a double. You can easily compute when you will start to loose precision in floating point based on the range of the input and the number of elements in the sum (summing 2 values requires 1 extra bit of headroom, 4 values requires 2 bits, 8 values 3 bits etc). So for 1024 32 bit values you'll need 10 more bits to avoid any loss of precision due to truncation... etc. There is also arbitrary precision arithmetic if you don't want to throw any bits away. There is something called "double double" which is a software 128 bit floating point type that maybe isn't too expensive.

One problem with floating point is that adding a large value and a small value will truncate the small value (first it needs to be shifted so it has the same exponent as the large value). You didn't say much about your values, but assuming that you're adding values distributed around a non-zero mean, the accumulator will increase in value as you add more values. Thus later summands will be truncated more than earlier ones. One way to minimise this is to maintain multiple accumulators and only sum a certain number of values into each one (or sum into successive accumulators kind of like a ring buffer of accumulators). Then sum the accumulators together at the end. this reduces truncation effects since each accumulator has a limited range (hence higher precision of summation) and when you sum the final accumulators together (hopefully) they will all have a similar range). A variation on this, if you know your signals have different magnitudes (eg you are summing both X and X^2), is to use separate accumulators for each magnitude class - since these are obviously going to have vastly different domains.

You also need to consider what form the final output will take. If the output is low precision then the best you can hope for is that each input makes an equal contribution to the output -- you need enough precision in your accumulator to ensure this.

For some uses you could consider dithering the output to improve the output.

Ross.

--
dupswapdrop -- the music-dsp mailing list and website:
subscription info, FAQ, source code archive, list archive, book reviews, dsp 
links
http://music.columbia.edu/cmc/music-dsp
http://music.columbia.edu/mailman/listinfo/music-dsp

Reply via email to