[PATCH] D155769: [Clang][docs][RFC] Add documentation for C++ Parallel Algorithm Offload

Alex Voicu via Phabricator via cfe-commits Wed, 19 Jul 2023 18:59:16 -0700

AlexVlx created this revision.
AlexVlx added reviewers: rjmccall, yaxunl, aaron.ballman.
AlexVlx added a project: clang.
Herald added a subscriber: arphaman.
Herald added a project: All.
AlexVlx requested review of this revision.
Herald added a subscriber: cfe-commits.


This patch adds the documentation for the standard algorithm offload feature 
being proposed here: 
https://discourse.llvm.org/t/rfc-adding-c-parallel-algorithm-offload-support-to-clang-llvm/72159/1.
 It is the parent of a series of patches that make up the implementation.


Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D155769

Files:
  clang/docs/StdParSupport.rst
  clang/docs/index.rst

Index: clang/docs/index.rst
===================================================================
--- clang/docs/index.rst
+++ clang/docs/index.rst
@@ -47,6 +47,7 @@
    OpenCLSupport
    OpenMPSupport
    SYCLSupport
+   StdParSupport
    HLSL/HLSLDocs
    ThinLTO
    APINotes
Index: clang/docs/StdParSupport.rst
===================================================================
--- /dev/null
+++ clang/docs/StdParSupport.rst
@@ -0,0 +1,381 @@
+==============================================================
+C++ Standard Parallelism Offload Support: Compiler And Runtime
+==============================================================
+
+.. contents::
+   :local:
+
+Introduction
+============
+
+This document describes the implementation of support for offloading the
+execution of standard C++ algorithms to accelerators that can be targeted via
+HIP. Furthermore, it enumerates restrictions on user defined code, as well as
+the interactions with runtimes.
+
+Algorithm Offload: What, Why, Where
+===================================
+
+C++17 introduced overloads
+`for most algorithms in the standard library <https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0024r2.html>`_
+which allow the user to specify a desired
+`execution policy <https://en.cppreference.com/w/cpp/algorithm#Execution_policies>`_.
+The `parallel_unsequenced_policy <https://en.cppreference.com/w/cpp/algorithm/execution_policy_tag_t>`_
+maps relatively well to the execution model of many accelerators, such as GPUs.
+This, coupled with the ubiquity of GPU accelerated algorithm libraries that
+implement most / all corresponding libraries in the standard library
+(e.g. `rocThrust <https://github.com/ROCmSoftwarePlatform/rocThrust>`_), makes
+it feasible to provide seamless accelerator offload for supported algorithms,
+when an accelerated version exists. Thus, it becomes possible to easily access
+the computational resources of an accelerator, via a well specified, familiar,
+algorithmic interface, without having to delve into low-level hardware specific
+details. Putting it all together:
+
+- **What**: standard library algorithms, when invoked with the
+  ``parallel_unsequenced_policy``
+- **Why**: democratise accelerator programming, without loss of user familiarity
+- **Where**: any and all accelerators that can be targeted by Clang/LLVM via HIP
+
+Small Example
+=============
+
+Given the following C++ code, which assumes the ``std`` namespace is included:
+
+.. code-block:: C++
+
+   bool has_the_answer(const vector<int>& v) {
+     return find(execution::par_unseq, cbegin(v), cend(v), 42) != cend(v);
+   }
+
+if Clang is invoked with the ``-stdpar --offload-target=foo`` flags, the call to
+``find`` will be offloaded to an accelerator that is part of the ``foo`` target
+family. If either ``foo`` or its runtime environment do not support transparent
+on-demand paging (such as e.g. that provided in Linux via
+`HMM <https://docs.kernel.org/mm/hmm.html>`_), it is necessary to also include
+the ``--stdpar-interpose-alloc`` flag. If the accelerator specific algorithm
+library ``foo`` uses doesn't have an implementation of a particular algorithm,
+execution seamlessly falls back to the host CPU. It is legal to specify multiple
+``--offload-target``s. All the flags we introduce, as well as a thorough view of
+various restrictions and their implications will be provided below.
+
+Implementation - General View
+=============================
+
+We built support for Algorithm Offload support atop the pre-existing HIP
+infrastructure. More specifically, when one requests offload via ``-stdpar``,
+compilation is switched to HIP compilation, as if ``-x hip`` was specified.
+Similarly, linking is also switched to HIP linking, as if ``--hip-link`` was
+specified. Note that these are implicit, and one should not assume that any
+interop with HIP specific language constructs is available e.g. ``__device__``
+annotations are neither necessary nor guaranteed to work.
+
+Since there are no language restriction mechanisms in place, it is necessary to
+relax HIP language specific semantic checks performed by the FE; they would
+identify otherwise valid, offloadable code, as invalid HIP code. Given that we
+know that the user intended only for certain algorithms to be offloaded, and
+encoded this by specifying the ``parallel_unsequenced_policy``, we rely on a
+pass over IR to clean up any and all code that was not "meant" for offload. If
+requested, allocation interposition is also handled via a separate pass over IR.
+
+To interface with the client HIP runtime, and to forward offloaded algorithm
+invocations to the corresponding accelerator specific library implementation, an
+implementation detail forwarding header is implicitly included by the driver,
+when compiling with ``-stdpar``. In what follows, we will delve into each
+component that contributes to implementing Algorithm Offload support.
+
+Implementation - Driver
+=======================
+
+We augment the ``clang`` driver with the following flags:
+
+- ``-stdpar`` / ``--stdpar`` enables algorithm offload, which depending on
+  phase, has the following effects:
+
+  - when compiling:
+
+    - ``-x hip`` gets prepended to enable HIP support;
+    - the ``ROCmToolchain`` component checks for the ``stdpar_lib.hpp``
+      forwarding header,
+      `rocThrust <https://rocm.docs.amd.com/projects/rocThrust/en/latest/>`_ and
+      `rocPrim <https://rocm.docs.amd.com/projects/rocPRIM/en/latest/>`_ in
+      their canonical locations, which can be overriden via flags found below;
+      if all are found, the forwarding header gets implicitly included,
+      otherwise an error listing the missing component is generated;
+    - the ``LangOpts.HIPStdPar`` member is set.
+
+  - when linking:
+
+    - ``--hip-link`` and ``-frtlib-add-rpath`` gets appended to enable HIP
+      support.
+
+- ``-stdpar-interpose-alloc`` / ``--stdpar-interpose-alloc`` enables the
+  interposition of standard allocation / deallocation functions with accelerator
+  aware equivalents; the ``LangOpts.HIPStdParInterposeAlloc`` member is set;
+- ``--stdpar-path=`` specifies a non-canonical path for the forwarding header;
+  it must point to the folder where the header is located and not to the header
+  itself;
+- ``--stdpar-thrust-path=`` specifies a non-canonical path for
+  `rocThrust <https://rocm.docs.amd.com/projects/rocThrust/en/latest/>`_; it
+  must point to the folder where the library is installed / built under a
+  ``/thrust`` subfolder;
+- ``--stdpar-prim-path=`` specifies a non-canonical path for
+  `rocPrim <https://rocm.docs.amd.com/projects/rocPRIM/en/latest/>`_; it must
+  point to the folder where the library is installed / built under a
+  ``/rocprim`` subfolder;
+
+The `--offload-arch <https://llvm.org/docs/AMDGPUUsage.html#amdgpu-processors>`_
+flag can be used to specify the accelerator for which offload code is to be
+generated.
+
+Implementation - Front-End
+==========================
+
+When ``LangOpts.HIPStdPar`` is set, we relax some of the HIP language specific
+``Sema`` checks to account for the fact that we want to consume pure unannotated
+C++ code:
+
+1. ``__device__`` / ``__host__ __device__`` functions (which would originate in
+   the accelerator specific algorithm library) are allowed to call implicitly
+   ``__host__`` functions;
+2. ``__global__`` functions (which would originate in the accelerator specific
+   algorithm library) are allowed to call implicitly ``__host__`` functions;
+3. resolving ``__builtin`` availability is deferred, because it is possible that
+   a ``__builtin`` that is unavailable on the target accelerator is not
+   reachable from any offloaded algorithm, and thus will be safely removed in
+   the middle-end;
+4. ASM parsing / checking is deferred, because it is possible that an ASM block
+   that e.g. uses some constraints that are incompatible with the target
+   accelerator is not reachable from any offloaded algorithm, and thus will be
+   safely removed in the middle-end.
+
+``CodeGen`` is similarly relaxed, with implicitly ``__host__`` functions being
+emitted as well.
+
+Implementation - Middle-End
+===========================
+
+We add two ``opt`` passes:
+
+1. ``StdParAcceleratorCodeSelectionPass``
+
+   - For all kernels in a ``Module``, compute reachability, where a function
+     ``F`` is reachable from a kernel ``K`` if and only if there exists a direct
+     call-chain rooted in ``F`` that includes ``K``;
+   - Remove all functions that are not reachable from kernels;
+   - This pass is only run when compiling for the accelerator.
+
+The first pass assumes that the only code that the user intended to offload was
+that which was directly or transitively invocable as part of an algorithm
+execution. It also assumes that an accelerator aware algorithm implementation
+would rely on accelerator specific special functions (kernels), and that these
+effectively constitute the only roots for accelerator execution graphs. Both of
+these assumptions are based on observing how widespread accelerators,
+such as GPUs, work.
+
+1. ``StdParAllocationInterpositionPass``
+
+   - Iterate through all functions in a ``Module``, and replace standard
+     allocation / deallocation functions with accelerator-aware equivalents,
+     based on a pre-established table; the list of functions that can be
+     interposed is available
+     `here <https://github.com/ROCmSoftwarePlatform/roc-stdpar#allocation--deallocation-interposition-status>`_;
+   - This is only run when compiling for the host.
+
+The second pass is optional.
+
+Implementation - Forwarding Header
+==================================
+
+The forwarding header implements two pieces of functionality:
+
+1. It forwards algorithms to a target accelerator, which is done by relying on
+   C++ language rules around overloading:
+
+   - overloads taking an explicit argument of type
+     ``parallel_unsequenced_policy`` are introduced into the ``std`` namespace;
+   - these will get preferentially selected versus the master template;
+   - the body forwards to the equivalent algorithm from the accelerator specific
+     library
+
+2. It provides allocation / deallocation functions that are equivalent to the
+   standard ones, but obtain memory by invoking
+   `hipMallocManaged <https://rocm.docs.amd.com/projects/HIP/en/latest/.doxygen/docBin/html/group___memory_m.html#gab8cfa0e292193fa37e0cc2e4911fa90a>`_
+   and release it via `hipFree <https://rocm.docs.amd.com/projects/HIP/en/latest/.doxygen/docBin/html/group___memory.html#ga740d08da65cae1441ba32f8fedb863d1>`_.
+
+Restrictions
+============
+
+We define two modes in which runtime execution can occur:
+
+1. **HMM Mode** - this assumes that the
+   `HMM <https://docs.kernel.org/mm/hmm.html>`_ subsystem of the Linux kernel
+   is used to provide transparent on-demand paging i.e. memory obtained from a
+   system / OS allocator such as via a call to ``malloc`` or ``operator new`` is
+   directly accessible to the accelerator and it follows the C++ memory model;
+2. **Interposition Mode** - this is a fallback mode for cases where transparent
+   on-demand paging is unavailable (e.g. in the Windows OS), which means that
+   memory must be allocated via an accelerator aware mechanism, and system
+   allocated memory is inaccessible for the accelerator.
+
+The following restrictions imposed on user code apply to both modes:
+
+1. Pointers to function, and all associated features, such as e.g. dynamic
+   polymorphism, cannot be used (directly or transitively) by the user provided
+   callable passed to an algorithm invocation;
+2. Global / namespace scope / ``static`` / ``thread`` storage duration variables
+   cannot be used (directly or transitively) in name by the user provided
+   callable;
+
+   - When executing in **HMM Mode** they can be used in address e.g.:
+
+     .. code-block:: C++
+
+        namespace { int foo = 42; }
+
+        bool never(const vector<int>& v) {
+          return any_of(execution::par_unseq, cbegin(v), cend(v), [](auto&& x) {
+            return x == foo;
+          });
+        }
+
+        bool only_in_hmm_mode(const vector<int>& v) {
+          return any_of(execution::par_unseq, cbegin(v), cend(v),
+                        [p = &foo](auto&& x) { return x == *p; });
+        }
+
+3. Only algorithms that are invoked with the ``parallel_unsequenced_policy`` are
+   candidates for offload;
+4. Only algorithms that are invoked with iterator arguments that model
+   `random_access_iterator <https://en.cppreference.com/w/cpp/iterator/random_access_iterator>`_
+   are candidates for offload;
+5. `Exceptions <https://en.cppreference.com/w/cpp/language/exceptions>`_ cannot
+   be used by the user provided callable;
+6. Dynamic memory allocation (e.g. ``operator new``) cannot be used by the user
+   provided callable;
+7. Selective offload is not possible i.e. it is not possible to indicate that
+   only some algorithms invoked with the ``parallel_unsequenced_policy`` are to
+   be executed on the accelerator.
+
+In addition to the above, using **Interposition Mode** imposes the following
+additional restrictions:
+
+1. All code that is expected to interoperate has to be recompiled with the
+   ``--stdpar-interpose-alloc`` flag i.e. it is not safe to compose libraries
+   that have been independently compiled;
+2. automatic storage duration (i.e. stack allocated) variables cannot be used
+   (directly or transitively) by the user provided callable e.g.
+
+   .. code-block:: c++
+
+      bool never(const vector<int>& v, int n) {
+        return any_of(execution::par_unseq, cbegin(v), cend(v),
+                      [p = &n](auto&& x) { return x == *p; });
+      }
+
+Current Support
+===============
+
+At the moment, C++ Standard Parallelism Offload is only available for AMD GPUs,
+when the `ROCm <https://rocm.docs.amd.com/en/latest/>`_ stack is used, on the
+Linux operating system. Whilst the design outlined above is generic and target
+independent, only the above combination has been validated. In the future, as
+other accelerators that can be targeted via HIP are validated, and if they
+choose to implement a forwarding header (or contribute to the existing one),
+support will be extended.
+
+Focusing on AMD GPU targets, support is synthesised in the following table
+
+.. list-table::
+   :header-rows: 1
+
+   * - `Processor <https://llvm.org/docs/AMDGPUUsage.html#amdgpu-processors>`_
+     - HMM Mode
+     - Interposition Mode
+   * - GCN GFX9 (Vega)
+     - YES
+     - YES
+   * - GCN GFX10.1 (RDNA 1)
+     - *NO*
+     - YES
+   * - GCN GFX10.3 (RDNA 2)
+     - *NO*
+     - YES
+   * - GCN GFX11 (RDNA 3)
+     - *NO*
+     - YES
+
+The minimum Linux kernel version for running in HMM mode is 6.4.
+
+The forwarding header can be obtained from
+`its GitHub repository <https://github.com/ROCmSoftwarePlatform/roc-stdpar>`_.
+It will be packaged with a future `ROCm <https://rocm.docs.amd.com/en/latest/>`_
+release. Because accelerated algorithms are provided via
+`rocThrust <https://rocm.docs.amd.com/projects/rocThrust/en/latest/>`_, a
+transitive dependency on
+`rocPrim <https://rocm.docs.amd.com/projects/rocPRIM/en/latest/>`_ exists. Both
+can be obtained either by installing their associated components of the
+`ROCm <https://rocm.docs.amd.com/en/latest/>`_ stack, or from their respective
+repositories. The list algorithms that can be offloaded is available
+`here <https://github.com/ROCmSoftwarePlatform/roc-stdpar#algorithm-support-status>`_.
+
+HIP Specific Elements
+---------------------
+
+Whilst the support for C++ Standard Parallelism Offload is generic, and not
+defined in terms of the HIP language or HIP runtime APIs, there are consequences
+to using the latter in the implementation. We enumerate a few which are likely
+to be relevant to users:
+
+1. There is no defined interop with the
+   `HIP kernel language <https://rocm.docs.amd.com/projects/HIP/en/latest/reference/kernel_language.html>`_;
+   whilst things like using `__device__` annotations might accidentally "work",
+   they are not guaranteed to, and thus cannot be relied upon by user code;
+2. Combining explicit HIP compilation with ``--stdpar`` based offloading is not
+   allowed or supported in any way.
+3. There is no way to target different accelerators via a standard algorithm
+   invocation (`this might be addressed in future C++ standards <https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p2500r1.html>`_);
+   an unsafe (per the point above) way of achieving this is to spawn new threads
+   and invoke the `hipSetDevice <https://rocm.docs.amd.com/projects/HIP/en/latest/.doxygen/docBin/html/group___device.html#ga43c1e7f15925eeb762195ccb5e063eae>`_
+   interface e.g.:
+
+   .. code-block:: c++
+
+      int accelerator_0 = ...;
+      int accelerator_1 = ...;
+
+      bool multiple_accelerators(const vector<int>& u, const vector<int>& v) {
+        atomic<unsigned int> r{0u};
+
+        thread t0{[&]() {
+          hipSetDevice(accelerator_0);
+
+          r += count(execution::par_unseq, cbegin(u), cend(u), 42);
+        }};
+        thread t1{[&]() {
+          hitSetDevice(accelerator_1);
+
+          r += count(execution::par_unseq, cbegin(v), cend(v), 314152)
+        }};
+
+        t0.join();
+        t1.join();
+
+        return r;
+      }
+
+   Note that this is a temporary, unsafe workaround for a deficiency in the C++
+   Standard.
+
+Open Questions / Future Developments
+====================================
+
+1. The restriction on the use of global / namespace scope / ``static`` /
+   ``thread`` storage duration variables in offloaded algorithms will be lifted
+   in the future, when running in **HMM Mode**;
+2. The restriction on the use of dynamic memory allocation in offloaded
+   algorithms will be lifted in the future.
+3. The restriction on the use of pointers to function, and associated features
+   such as dynamic polymorphism might be lifted in the future, when running in
+   **HMM Mode**;
+4. Offload support might be extended to cases where the ``parallel_policy`` is
+   used for some or all targets.

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D155769: [Clang][docs][RFC] Add documentation for C++ Parallel Algorithm Offload

Reply via email to