"Terje Mathisen" <"terje.mathisen at tmsw.no"@giganews.com> wrote in message
news:[email protected]...
> sfuerst wrote:
>> There is a straight-forward algorithm using the fact that only one of
>> the bounds can be crossed...
>> Something like this:
>> (Inputs in %xmm0, and %xmm1, output in %xmm0)
>> subsd %xmm1,%xmm0
>> movsd plusM_PI(%rip), %xmm1
>> movsd minusM_PI(%rip), %xmm2
>> cmpgtsd %xmm0, %xmm1
>> cmpltsd %xmm0, %xmm2
>> andpd minus2M_PI(%rip), %xmm1
>> andpd plus2M_PI(%rip), %xmm2
>> addsd %xmm1, %xmm0
>> addsd %xmm2, %xmm0
>> I probably have some of the comparisons reversed by mistake... but you
>> get the idea. You can do both comparisons in parallel. Using sign
>> tricks doesn't seem to be profitable, as that increases the length of
>> the critical path.
> Very nice, and definitely much better than my approach!
> :-)
I really liked your approach more because it doesn't involve as many
loads nor as many long-latency operations like ADDSD and CMPccSD.
Looking at the above code we see four such long-latency instructions
in the path and I think we can do better with:
subsd xmm0, xmm1 ; {clock 1}
movsd xmm2, [signbits] ; -0.0 {asynchronous}
movaps xmm3, xmm2
andps xmm2, xmm0 ; sign(0.0,delta) {clock 4}
andnps xmm3, xmm0 ; abs(delta) {clock 4}
xorps xmm2, [minustwopi] ; -sign(2*pi,delta) {clock 5}
cmplesd xmm3, [pi] ; -1 or 0 {clock 5}
addsd xmm2, xmm0 ; delta-sign(2*pi,delta) {clock 6}
andps xmm0, xmm3 ; delta or 0 {clock 8}
andnps xmm3, xmm2 ; 0 or delta-sign(2*pi,delta) {clock 9}
orps xmm0, xmm3 ; delta or delta-sign(2*pi,delta) {clock 10}
--
write(*,*) transfer((/17.392111325966148d0,6.5794487871554595D-85, &
6.0134700243160014d-154/),(/'x'/)); end
_______________________________________________
help-gplusplus mailing list
[email protected]
https://lists.gnu.org/mailman/listinfo/help-gplusplus