On Wed, Mar 4, 2026 at 2:09 PM Thiago Macieira <[email protected]> wrote:
>
> On Wednesday, 4 March 2026 03:27:40 Pacific Standard Time Trevor Gross wrote:
> > This was brought up before in the thread at [1], with the concern about
> > efficient 16-bit moves between GPRs or memory and XMM. This doesn't seem
> > to be relevant, however, given there isn't any reason to have a _Float16
> > in XMM unless F16C is available, implying SSE2 and SSE4.1 for PINSRW and
> > PEXTRW to/from memory (unless I am missing something?).
>
> There is still a cost of transferring from one register file to another: those
> operations cost 3 cycles. That would imply efficient software that uses F16C 
> or
> (better yet) AVX512FP16 would pay an extra 3-cycle penalty to move into a GPR
> on function return and another 3 cycles to reload it back into the SSE
> register file.
>
> This is of course the opposite of what would happen on systems requiring
> emuation of FP16 conversions: one would pay a 3-cycle penalty to move from GPR
> to SSE on function return and another 3 cycles to move it back to make any use
> of the returned number.

It indeed is not maximally efficient, but any `float` or `double` code
is already paying a similar (or slightly higher) cost for %st0 return
right? At least if any operations are done in XMM registers, which
Clang likes to do whenever SSE2 is available (or GCC with options).

The compatibility issues using XMM doesn't seem necessarily worth the
cycle savings specifically for _Float16, given the cost for other
floats at non-inlineable function boundaries. Especially when many ops
with the type require a f16<->f32 conversion, which itself doesn't
have the call overhead (if supported).

> So there are two questions to be answered, one of which has already been:
>
> 1) does FP16 support require SSE?
>
> H.J. stated it does in the discussion you linked to and no one argued.

I took Joseph's first reply on the thread to be an expression of some
disagreement, followed by discussion about efficient GPR<->XMM to
support a GPR return that didn't exactly come to a conclusion. But it
is possible I am misreading here, none of this is stated explicitly.

(Joseph's email address from that thread bounced, added a new one here.)

> 2) whom are we optimising this for: emulated conversions or HW-backed ones?
>
> F16C was first introduced in 2013, though there are still systems without AVX
> being produced (e.g. embedded Pentium and Celeron). But they already have a
> massive performance loss by having to convert to and from FP32 in software,
> before performing even simple math like:
>
> _Float16 f(_Float16 a, _Float16 b)
> {
>     return a + b;
> }
>

At the ABI level the choice isn't between two performance optimization
goals, but rather between optimization and compatibility. The current
_Float16 ABI does lean toward optimization (as much as possible with
stack passing), but this makes it the only C-specificed type to not be
compatible with baseline i386.

> So I'd argue it's not worth optimising for them, and it's far better to allow
> the best performance when one has HW-backed conversion instructions (and for
> GCC, using -mfpmath=sse).

This is a bit of a tangent but I think it would be much more useful to
have an ABI-changing flag that raises the baseline to SSE2 and returns
_Float16, float, and double in XMM. That gets the return ABI
performance improvement for all float types, not just _Float16, and
effectively resolves a whole class of issues for x86-32 users like
[1], [2], [3].

> Are you asking to reopen the "requires SSE" discussion?

That is my interest here, to the extent that is possible at this point.

Thanks,
Trevor

> --
> Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
>   Principal Engineer - Intel Data Center - Platform & Sys. Eng.

[1]: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93681
[2]: https://github.com/llvm/llvm-project/issues/44218
[3]: https://github.com/llvm/llvm-project/issues/66803

Reply via email to