Hi Alessandro,
A lot has been written about this. Google "precision of summing floating
point values" and read the .pdfs on the first page for some analysis.
Follow the citations for more.
Somewhere there is a paper that analyses the performance of different
methods and suggests the optimal approach. I think it is signal dependent.
On 10/12/2012 11:32 AM, Alessandro Saccoia wrote:
It's going to be a sort of comulative process that goes on in time,
so I won't necessarily know N in advance. If had a strong evidence
that I should prefer one method over the other, I could decide to
keep all the temporary X1, X2,… and recompute everything each time.
Performance and storage are not a strong concern in this case, but
the quality of the final and intermediate results is.
Dividing by N only when you need to compute the sum sounds like a good
good idea, that way you won't be hashing the precision of each value
prior to summing it.
To avoid throwing out any information you could use high-precision
arithemetic. Did you say whether the signals are originally integers or
floating point? If they're integers, can you keep the sum in a 64 bit
int? Otherwise maybe a double. You can easily compute when you will
start to loose precision in floating point based on the range of the
input and the number of elements in the sum (summing 2 values requires 1
extra bit of headroom, 4 values requires 2 bits, 8 values 3 bits etc).
So for 1024 32 bit values you'll need 10 more bits to avoid any loss of
precision due to truncation... etc. There is also arbitrary precision
arithmetic if you don't want to throw any bits away. There is something
called "double double" which is a software 128 bit floating point type
that maybe isn't too expensive.
One problem with floating point is that adding a large value and a small
value will truncate the small value (first it needs to be shifted so it
has the same exponent as the large value). You didn't say much about
your values, but assuming that you're adding values distributed around a
non-zero mean, the accumulator will increase in value as you add more
values. Thus later summands will be truncated more than earlier ones.
One way to minimise this is to maintain multiple accumulators and only
sum a certain number of values into each one (or sum into successive
accumulators kind of like a ring buffer of accumulators). Then sum the
accumulators together at the end. this reduces truncation effects since
each accumulator has a limited range (hence higher precision of
summation) and when you sum the final accumulators together (hopefully)
they will all have a similar range).
A variation on this, if you know your signals have different magnitudes
(eg you are summing both X and X^2), is to use separate accumulators for
each magnitude class - since these are obviously going to have vastly
different domains.
You also need to consider what form the final output will take. If the
output is low precision then the best you can hope for is that each
input makes an equal contribution to the output -- you need enough
precision in your accumulator to ensure this.
For some uses you could consider dithering the output to improve the output.
Ross.
--
dupswapdrop -- the music-dsp mailing list and website:
subscription info, FAQ, source code archive, list archive, book reviews, dsp
links
http://music.columbia.edu/cmc/music-dsp
http://music.columbia.edu/mailman/listinfo/music-dsp