"Terje Mathisen" <"terje.mathisen at tmsw.no"@giganews.com> wrote in message news:5gh6u8-3062....@ntp6.tmsw.no...
> sfuerst wrote: >> There is a straight-forward algorithm using the fact that only one of >> the bounds can be crossed... >> Something like this: >> (Inputs in %xmm0, and %xmm1, output in %xmm0) >> subsd %xmm1,%xmm0 >> movsd plusM_PI(%rip), %xmm1 >> movsd minusM_PI(%rip), %xmm2 >> cmpgtsd %xmm0, %xmm1 >> cmpltsd %xmm0, %xmm2 >> andpd minus2M_PI(%rip), %xmm1 >> andpd plus2M_PI(%rip), %xmm2 >> addsd %xmm1, %xmm0 >> addsd %xmm2, %xmm0 >> I probably have some of the comparisons reversed by mistake... but you >> get the idea. You can do both comparisons in parallel. Using sign >> tricks doesn't seem to be profitable, as that increases the length of >> the critical path. > Very nice, and definitely much better than my approach! > :-) I really liked your approach more because it doesn't involve as many loads nor as many long-latency operations like ADDSD and CMPccSD. Looking at the above code we see four such long-latency instructions in the path and I think we can do better with: subsd xmm0, xmm1 ; {clock 1} movsd xmm2, [signbits] ; -0.0 {asynchronous} movaps xmm3, xmm2 andps xmm2, xmm0 ; sign(0.0,delta) {clock 4} andnps xmm3, xmm0 ; abs(delta) {clock 4} xorps xmm2, [minustwopi] ; -sign(2*pi,delta) {clock 5} cmplesd xmm3, [pi] ; -1 or 0 {clock 5} addsd xmm2, xmm0 ; delta-sign(2*pi,delta) {clock 6} andps xmm0, xmm3 ; delta or 0 {clock 8} andnps xmm3, xmm2 ; 0 or delta-sign(2*pi,delta) {clock 9} orps xmm0, xmm3 ; delta or delta-sign(2*pi,delta) {clock 10} -- write(*,*) transfer((/17.392111325966148d0,6.5794487871554595D-85, & 6.0134700243160014d-154/),(/'x'/)); end _______________________________________________ help-gplusplus mailing list help-gplusplus@gnu.org https://lists.gnu.org/mailman/listinfo/help-gplusplus