[PATCH] D150913: [Clang][BFloat16] Upgrade __bf16 to arithmetic type, change mangling, and extend excess precision support.

Phoebe Wang via Phabricator via cfe-commits Thu, 25 May 2023 19:16:31 -0700

pengfei added inline comments.


================
Comment at: clang/docs/LanguageExtensions.rst:852
 ``double`` when passed to ``printf``, so the programmer must explicitly cast 
it to
 ``double`` before using it with an ``%f`` or similar specifier.
 
----------------
rjmccall wrote:
> codemzs wrote:
> > pengfei wrote:
> > > rjmccall wrote:
> > > > pengfei wrote:
> > > > > rjmccall wrote:
> > > > > > Suggested rework:
> > > > > > 
> > > > > > ```
> > > > > > Clang supports three half-precision (16-bit) floating point types: 
> > > > > > ``__fp16``,
> > > > > > ``_Float16`` and ``__bf16``.  These types are supported in all 
> > > > > > language
> > > > > > modes, but not on all targets:
> > > > > > 
> > > > > > - ``__fp16`` is supported on every target.
> > > > > > 
> > > > > > - ``_Float16`` is currently supported on the following targets:
> > > > > >   * 32-bit ARM (natively on some architecture versions)
> > > > > >   * 64-bit ARM (AArch64) (natively on ARMv8.2a and above)
> > > > > >   * AMDGPU (natively)
> > > > > >   * SPIR (natively)
> > > > > >   * X86 (if SSE2 is available; natively if AVX512-FP16 is also 
> > > > > > available)
> > > > > > 
> > > > > > - ``__bf16`` is currently supported on the following targets:
> > > > > >   * 32-bit ARM
> > > > > >   * 64-bit ARM (AArch64)
> > > > > >   * X86 (when SSE2 is available)
> > > > > > 
> > > > > > (For X86, SSE2 is available on 64-bit and all recent 32-bit 
> > > > > > processors.)
> > > > > > 
> > > > > > ``__fp16`` and ``_Float16`` both use the binary16 format from IEEE
> > > > > > 754-2008, which provides a 5-bit exponent and an 11-bit significand
> > > > > > (counting the implicit leading 1).  ``__bf16`` uses the `bfloat16
> > > > > > <https://en.wikipedia.org/wiki/Bfloat16_floating-point_format>`_ 
> > > > > > format,
> > > > > > which provides an 8-bit exponent and an 8-bit significand; this is 
> > > > > > the same
> > > > > > exponent range as `float`, just with greatly reduced precision.
> > > > > > 
> > > > > > ``_Float16`` and ``__bf16`` follow the usual rules for arithmetic
> > > > > > floating-point types.  Most importantly, this means that arithmetic 
> > > > > > operations
> > > > > > on operands of these types are formally performed in the type and 
> > > > > > produce
> > > > > > values of the type.  ``__fp16`` does not follow those rules: most 
> > > > > > operations
> > > > > > immediately promote operands of type ``__fp16`` to ``float``, and so
> > > > > > arithmetic operations are defined to be performed in ``float`` and 
> > > > > > so result in
> > > > > > a value of type ``float`` (unless further promoted because of other 
> > > > > > operands).
> > > > > > See below for more information on the exact specifications of these 
> > > > > > types.
> > > > > > 
> > > > > > Only some of the supported processors for ``__fp16`` and ``__bf16`` 
> > > > > > offer
> > > > > > native hardware support for arithmetic in their corresponding 
> > > > > > formats.
> > > > > > The exact conditions are described in the lists above.  When 
> > > > > > compiling for a
> > > > > > processor without native support, Clang will perform the arithmetic 
> > > > > > in
> > > > > > ``float``, inserting extensions and truncations as necessary.  This 
> > > > > > can be
> > > > > > done in a way that exactly emulates the behavior of hardware 
> > > > > > support for
> > > > > > arithmetic, but it can require many extra operations.  By default, 
> > > > > > Clang takes
> > > > > > advantage of the C standard's allowances for excess precision in 
> > > > > > intermediate
> > > > > > operands in order to eliminate intermediate truncations within 
> > > > > > statements.
> > > > > > This is generally much faster but can generate different results 
> > > > > > from strict
> > > > > > operation-by-operation emulation.
> > > > > > 
> > > > > > The use of excess precision can be independently controlled for 
> > > > > > these two
> > > > > > types with the ``-ffloat16-excess-precision=`` and
> > > > > > ``-fbfloat16-excess-precision=`` options.  Valid values include:
> > > > > > - ``none`` (meaning to perform strict operation-by-operation 
> > > > > > emulation)
> > > > > > - ``standard`` (meaning that excess precision is permitted under 
> > > > > > the rules
> > > > > >   described in the standard, i.e. never across explicit casts or 
> > > > > > statements)
> > > > > > - ``fast`` (meaning that excess precision is permitted whenever the
> > > > > >   optimizer sees an opportunity to avoid truncations; currently 
> > > > > > this has no
> > > > > >   effect beyond ``standard``)
> > > > > > 
> > > > > > The ``_Float16`` type is an interchange floating type specified in
> > > > > >  ISO/IEC TS 18661-3:2015 ("Floating-point extensions for C").  It 
> > > > > > will
> > > > > > be supported on more targets as they define ABIs for it.
> > > > > > 
> > > > > > The ``__bf16`` type is a non-standard extension, but it generally 
> > > > > > follows
> > > > > > the rules for arithmetic interchange floating types from ISO/IEC TS
> > > > > > 18661-3:2015.  In previous versions of Clang, it was a storage-only 
> > > > > > type
> > > > > > that forbade arithmetic operations.  It will be supported on more 
> > > > > > targets
> > > > > > as they define ABIs for it.
> > > > > > 
> > > > > > The ``__fp16`` type was originally an ARM extension and is specified
> > > > > > by the `ARM C Language Extensions 
> > > > > > <https://github.com/ARM-software/acle/releases>`_.
> > > > > > Clang uses the ``binary16`` format from IEEE 754-2008 for 
> > > > > > ``__fp16``,
> > > > > > not the ARM alternative format.  Operators that expect arithmetic 
> > > > > > operands
> > > > > > immediately promote ``__fp16`` operands to ``float``.
> > > > > > 
> > > > > > It is recommended that portable code use ``_Float16`` instead of 
> > > > > > ``__fp16``,
> > > > > > as it has been defined by the C standards committee and has 
> > > > > > behavior that is
> > > > > > more familiar to most programmers.
> > > > > > 
> > > > > > Because ``__fp16`` operands are always immediately promoted to 
> > > > > > ``float``, the
> > > > > > common real type of ``__fp16`` and ``_Float16`` for the purposes of 
> > > > > > the usual
> > > > > > arithmetic conversions is ``float``.
> > > > > > 
> > > > > > A literal can be given ``_Float16`` type using the suffix ``f16``. 
> > > > > > For example,
> > > > > > ``3.14f16``.
> > > > > > 
> > > > > > Because default argument promotion only applies to the standard 
> > > > > > floating-point
> > > > > > types, ``_Float16`` values are not promoted to ``double`` when 
> > > > > > passed as variadic
> > > > > > or untyped arguments.  As a consequence, some caution must be taken 
> > > > > > when using
> > > > > > certain library facilities with ``_Float16``; for example, there is 
> > > > > > no ``printf`` format
> > > > > > specifier for ``_Float16``, and (unlike ``float``) it will not be 
> > > > > > implicitly promoted to
> > > > > > ``double`` when passed to ``printf``, so the programmer must 
> > > > > > explicitly cast it to
> > > > > > ``double`` before using it with an ``%f`` or similar specifier.
> > > > > > ```
> > > > > ```
> > > > > Only some of the supported processors for ``__fp16`` and ``__bf16`` 
> > > > > offer
> > > > > native hardware support for arithmetic in their corresponding formats.
> > > > > ```
> > > > > 
> > > > > Do you mean ``_Float16``?
> > > > > 
> > > > > ```
> > > > > The exact conditions are described in the lists above.  When 
> > > > > compiling for a
> > > > > processor without native support, Clang will perform the arithmetic in
> > > > > ``float``, inserting extensions and truncations as necessary.
> > > > > ```
> > > > > 
> > > > > It's a bit conflict with `These types are supported in all language 
> > > > > modes, but not on all targets`.
> > > > > Why do we need to emulate for a type that doesn't necessarily support 
> > > > > on all target?
> > > > > 
> > > > > My understand is that inserting extensions and truncations are used 
> > > > > for 2 purposes:
> > > > > 1. A type that is designed to support all target. For now, it's only 
> > > > > used for __fp16.
> > > > > 2. Support excess-precision=`standard`. This applies for both 
> > > > > _Float16 and __bf16.
> > > > > 
> > > > > Do you mean `_Float16`?
> > > > 
> > > > Yes, thank you.  I knew I'd screw that up somewhere.
> > > > 
> > > > > Why do we need to emulate for a type that doesn't necessarily support 
> > > > > on all target?
> > > > 
> > > > Would this be clearer?
> > > > 
> > > > ```
> > > > Arithmetic on ``_Float16`` and ``__bf16`` is enabled on some targets 
> > > > that don't
> > > > provide native architectural support for arithmetic on these formats.  
> > > > These
> > > > targets are noted in the lists of supported targets above.  On these 
> > > > targets,
> > > > Clang will perform the arithmetic in ``float``, inserting extensions 
> > > > and truncations
> > > > as necessary.
> > > > ```
> > > > 
> > > > > My understand is that inserting extensions and truncations are used 
> > > > > for 2 purposes:
> > > > 
> > > > No, I believe we always insert extensions and truncations.  The cases 
> > > > you're describing are places we insert extensions and truncations in 
> > > > the *frontend*, so that the backend doesn't see operations on `half` / 
> > > > `bfloat` at all.  But when these operations do make it to the backend, 
> > > > and there's no direct architectural support for them on the target, the 
> > > > backend still just inserts extensions and truncations so it can do the 
> > > > arithmetic in `float`.  This is clearest in the ARM codegen 
> > > > (https://godbolt.org/z/q9KoGEYqb) because the conversions are just 
> > > > instructions, but you can also see it in the X86 codegen 
> > > > (https://godbolt.org/z/ejdd4P65W): all the runtime functions are just 
> > > > extensions/truncations, and the actual arithmetic is done with `mulss` 
> > > > and `addss`.  This frontend/backend distinction is not something that 
> > > > matters to users, so the documentation glosses over the difference.
> > > > 
> > > > I haven't done an exhaustive investigation, so it's possible that there 
> > > > are types and targets where we emit a compiler-rt call to do each 
> > > > operation instead, but those compiler-rt functions almost certainly 
> > > > just do an extension to float in the same way, so I don't think the 
> > > > documentation as written would be misleading for those targets, either.
> > > Thanks for the explanation! Sorry, I failed to make the distinction 
> > > between "support" and "natively support", I guess users may be confusing 
> > > at the beginning too.
> > > 
> > > I agree the documentation is to explain the whole behavior of compile to 
> > > user. I think we have 3 aspects that want to tell users:
> > > 
> > > 1. Whether a type is arithmetic type or not and is (natively) supported 
> > > by all targets or just a few;
> > > 2. The result of a type may not be consistent across different targets 
> > > or/and excess-precision value;
> > > 3. The excess-precision control doesn't take effect if a type is natively 
> > > supported by targets;
> > > 
> > > It would be more clear if we can give such a summary before the detailed 
> > > explanation.
> > Does adding the below to the top of the description make it more clear?
> > 
> > Half-Precision Floating Point
> > =============================
> > 
> > Clang supports three half-precision (16-bit) floating point types: 
> > ``__fp16``, ``_Float16`` and ``__bf16``. These types are supported in all 
> > language modes, but their support differs across targets. Here, it's 
> > important to understand the difference between "support" and "natively 
> > support":
> > 
> > - A type is "supported" if the compiler can handle code using that type, 
> > which might involve translating operations into an equivalent code that the 
> > target hardware understands.
> > - A type is "natively supported" if the hardware itself understands the 
> > type and can perform operations on it directly. This typically yields 
> > better performance and more accurate results.
> > 
> > Another crucial aspect to note is the consistency of the result of a type 
> > across different targets and excess-precision values. Different hardware 
> > (targets) might produce slightly different results due to the level of 
> > precision they support and how they handle excess-precision values. It 
> > means the same code can yield different results when compiled for different 
> > hardware.
> > 
> > Finally, note that the control of excess-precision does not take effect if 
> > a type is natively supported by targets. If the hardware supports the type 
> > directly, the compiler does not need to (and cannot) use excess precision 
> > to potentially speed up the operations.
> > 
> > Given these points, here is the detailed support for each type:
> > 
> > - ``__fp16`` is supported on every target.
> > 
> > - ``_Float16`` is currently supported on the following targets:
> >   * 32-bit ARM (natively on some architecture versions)
> >   * 64-bit ARM (AArch64) (natively on ARMv8.2a and above)
> >   * AMDGPU (natively)
> >   * SPIR (natively)
> >   * X86 (if SSE2 is available; natively if AVX512-FP16 is also available)
> > 
> > - ``__bf16`` is currently supported on the following targets:
> >   * 32-bit ARM
> >   * 64-bit ARM (AArch64)
> >   * X86 (when SSE2 is available)
> > 
> > ...
> > ...
> I think that's a good basic idea, but it's okay to leave some of the detail 
> for later.  How about this:
> 
> ```
> Clang supports three half-precision (16-bit) floating point types: 
> ``__fp16``, ``_Float16`` and ``__bf16``. These types are supported in all 
> language modes, but their support differs between targets.  A target is said 
> to have "native support" for a type if the target processor offers 
> instructions for directly performing basic arithmetic on that type.  In the 
> absence of native support, a type can still be supported if the compiler can 
> emulate arithmetic on the type by promoting to ``float``; see below for more 
> information on this emulation.
> 
> * ``__fp16`` is supported on all targets.  The special semantics of this type 
> mean that no arithmetic is ever performed directly on ``__fp16`` values; see 
> below.
> 
> * ``_Float16`` is supported on the following targets: (...)
> 
> * ``__bf16`` is supported on the following targets (currently never 
> natively): (...)
> ```
> 
> And then below we can adjust the paragraph about emulation:
> 
> ```
> When compiling arithmetic on ``_Float16`` and ``__bf16`` for a target without
> native support, Clang will perform the arithmetic in ``float``, inserting 
> extensions
> and truncations as necessary.  This can be done in a way that exactly matches 
> the
> operation-by-operation behavior of native support, but that can require many
> extra truncations and extensions.  By default, when emulating ``_Float16`` and
> ``__bf16`` arithmetic using ``float``, Clang does not truncate intermediate 
> operands
> back to their true type unless the operand is the result of an explicit cast 
> or
> assignment.  This is generally much faster but can generate different results 
> from
> strict operation-by-operation emulation.  (Usually the results are more 
> precise.)
> This is permitted by the C and C++ standards under the rules for excess 
> precision
> in intermediate operands; see the discussion of evaluation formats in the C
> standard and [expr.pre] in the C++ standard.
> ```
This revision looks better. The contents are rather clear to me. Thanks!


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D150913/new/

https://reviews.llvm.org/D150913

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D150913: [Clang][BFloat16] Upgrade __bf16 to arithmetic type, change mangling, and extend excess precision support.

Reply via email to