On Wednesday, 4 March 2026 03:27:40 Pacific Standard Time Trevor Gross wrote: > This was brought up before in the thread at [1], with the concern about > efficient 16-bit moves between GPRs or memory and XMM. This doesn't seem > to be relevant, however, given there isn't any reason to have a _Float16 > in XMM unless F16C is available, implying SSE2 and SSE4.1 for PINSRW and > PEXTRW to/from memory (unless I am missing something?).
There is still a cost of transferring from one register file to another: those
operations cost 3 cycles. That would imply efficient software that uses F16C or
(better yet) AVX512FP16 would pay an extra 3-cycle penalty to move into a GPR
on function return and another 3 cycles to reload it back into the SSE
register file.
This is of course the opposite of what would happen on systems requiring
emuation of FP16 conversions: one would pay a 3-cycle penalty to move from GPR
to SSE on function return and another 3 cycles to move it back to make any use
of the returned number.
So there are two questions to be answered, one of which has already been:
1) does FP16 support require SSE?
H.J. stated it does in the discussion you linked to and no one argued.
2) whom are we optimising this for: emulated conversions or HW-backed ones?
F16C was first introduced in 2013, though there are still systems without AVX
being produced (e.g. embedded Pentium and Celeron). But they already have a
massive performance loss by having to convert to and from FP32 in software,
before performing even simple math like:
_Float16 f(_Float16 a, _Float16 b)
{
return a + b;
}
So I'd argue it's not worth optimising for them, and it's far better to allow
the best performance when one has HW-backed conversion instructions (and for
GCC, using -mfpmath=sse).
Are you asking to reopen the "requires SSE" discussion?
--
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
Principal Engineer - Intel Data Center - Platform & Sys. Eng.
signature.asc
Description: This is a digitally signed message part.
