On 29 Nov 00, at 20:31, Jason Stratos Papadopoulos wrote:
> > Humm, too much slowdown for single Ia32 instructions, Intel engineers
> > will know the reasons.
Presumably something to do with not being parallel architecture;
there's only one set of flags, so you can't start an ADC until the
previous instruction has completed and the flags register has been
retired. In IA32, there's nothing to tell you _which_ carry flag to
use in an instruction which uses carry as an input, or output.
>
> This and the 14-18 clock multiply are extremely depressing. For all the
> marketing hype about Intel catering to the internet and the next
> generation of "active media", they're making it awfully difficult to
> implement the cryptography that the internet needs.
Well, you're not _forced_ to buy Intel - even if you stick to
Windoze! I get the impression that, with the Athlon architecture, AMD
are making serious inroads into the market - especially at the lower
end. Now it just so happens that the Athlon architecture runs Prime95
rather well.
>
> The alpha was already at least 5x faster than a PIII for multiprecision
> arithmetic at the same clock speed; with the P4 it will only get worse.
Are you sure about this? I think, with Alpha, you have to execute the
instruction twice to get a double-precision multiplication - you can
store either the low half or the high half of a product in one
instruction, but not both.
Of course, the Alpha's single precision integer arithmetic is 64
bits, not 32. This does help somewhat :)
It's also easy to get confused between the latency (time between
loading the opcode and finishing execution) and the throughput (the
inverse of the longest time an instruction spends in any particular
unit, times the number of the critical unit involved in the design).
The fact remains that the P4, like all other commercial processors,
is a compromise. This is what makes it so darned complicated, and
also indicates why the designers have to make performance tradeoffs,
some of which don't suit our particular application. It would be
possible to optimize the design so that it executed the instructions
_we_ find useful much more quickly, but whether such a processor
would be capable of running standard commercial benchmarking programs
at a reasonable speed - or indeed at all - is open to question ...
we'd probably want to reduce the inherent complexity by junking the
16-bit x86 legacy, for a start...
Regards
Brian Beesley
_________________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers