llvmbot wrote:
<!--LLVM PR SUMMARY COMMENT--> @llvm/pr-subscribers-mlgo Author: S. VenkataKeerthy (svkeerthy) <details> <summary>Changes</summary> Add MIR2Vec support to the llvm-ir2vec tool, enabling embedding generation for Machine IR alongside the existing LLVM IR functionality. --- Patch is 35.27 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/164025.diff 7 Files Affected: - (modified) llvm/docs/CommandGuide/llvm-ir2vec.rst (+76-19) - (modified) llvm/include/llvm/CodeGen/MIR2Vec.h (+50-7) - (modified) llvm/lib/CodeGen/MIR2Vec.cpp (+36-55) - (added) llvm/test/tools/llvm-ir2vec/embeddings-symbolic.mir (+92) - (added) llvm/test/tools/llvm-ir2vec/error-handling.mir (+41) - (modified) llvm/tools/llvm-ir2vec/CMakeLists.txt (+14) - (modified) llvm/tools/llvm-ir2vec/llvm-ir2vec.cpp (+253-35) ``````````diff diff --git a/llvm/docs/CommandGuide/llvm-ir2vec.rst b/llvm/docs/CommandGuide/llvm-ir2vec.rst index fc590a6180316..55fe75d2084b1 100644 --- a/llvm/docs/CommandGuide/llvm-ir2vec.rst +++ b/llvm/docs/CommandGuide/llvm-ir2vec.rst @@ -1,5 +1,5 @@ -llvm-ir2vec - IR2Vec Embedding Generation Tool -============================================== +llvm-ir2vec - IR2Vec and MIR2Vec Embedding Generation Tool +=========================================================== .. program:: llvm-ir2vec @@ -11,9 +11,9 @@ SYNOPSIS DESCRIPTION ----------- -:program:`llvm-ir2vec` is a standalone command-line tool for IR2Vec. It -generates IR2Vec embeddings for LLVM IR and supports triplet generation -for vocabulary training. +:program:`llvm-ir2vec` is a standalone command-line tool for IR2Vec and MIR2Vec. +It generates embeddings for both LLVM IR and Machine IR (MIR) and supports +triplet generation for vocabulary training. The tool provides three main subcommands: @@ -23,23 +23,33 @@ The tool provides three main subcommands: 2. **entities**: Generates entity mapping files (entity2id.txt) for vocabulary training. -3. **embeddings**: Generates IR2Vec embeddings using a trained vocabulary +3. **embeddings**: Generates IR2Vec or MIR2Vec embeddings using a trained vocabulary at different granularity levels (instruction, basic block, or function). +The tool supports two operation modes: + +* **LLVM IR mode** (``--mode=llvm``): Process LLVM IR bitcode files and generate + IR2Vec embeddings +* **Machine IR mode** (``--mode=mir``): Process Machine IR (.mir) files and generate + MIR2Vec embeddings + The tool is designed to facilitate machine learning applications that work with -LLVM IR by converting the IR into numerical representations that can be used by -ML models. The `triplets` subcommand generates numeric IDs directly instead of string -triplets, streamlining the training data preparation workflow. +LLVM IR or Machine IR by converting them into numerical representations that can +be used by ML models. The `triplets` subcommand generates numeric IDs directly +instead of string triplets, streamlining the training data preparation workflow. .. note:: - For information about using IR2Vec programmatically within LLVM passes and - the C++ API, see the `IR2Vec Embeddings <https://llvm.org/docs/MLGO.html#ir2vec-embeddings>`_ + For information about using IR2Vec and MIR2Vec programmatically within LLVM + passes and the C++ API, see the `IR2Vec Embeddings <https://llvm.org/docs/MLGO.html#ir2vec-embeddings>`_ section in the MLGO documentation. OPERATION MODES --------------- +The tool operates in two modes: **LLVM IR mode** and **Machine IR mode**. The mode +is selected using the ``--mode`` option (default: ``llvm``). + Triplet Generation and Entity Mapping Modes are used for preparing vocabulary and training data for knowledge graph embeddings. The Embedding Mode is used for generating embeddings from LLVM IR using a pre-trained vocabulary. @@ -89,18 +99,31 @@ Embedding Generation ~~~~~~~~~~~~~~~~~~~~ With the `embeddings` subcommand, :program:`llvm-ir2vec` uses a pre-trained vocabulary to -generate numerical embeddings for LLVM IR at different levels of granularity. +generate numerical embeddings for LLVM IR or Machine IR at different levels of granularity. + +Example Usage for LLVM IR: + +.. code-block:: bash + + llvm-ir2vec embeddings --mode=llvm --ir2vec-vocab-path=vocab.json --ir2vec-kind=symbolic --level=func input.bc -o embeddings.txt -Example Usage: +Example Usage for Machine IR: .. code-block:: bash - llvm-ir2vec embeddings --ir2vec-vocab-path=vocab.json --ir2vec-kind=symbolic --level=func input.bc -o embeddings.txt + llvm-ir2vec embeddings --mode=mir --mir2vec-vocab-path=vocab.json --level=func input.mir -o embeddings.txt OPTIONS ------- -Global options: +Common options (applicable to both LLVM IR and Machine IR modes): + +.. option:: --mode=<mode> + + Specify the operation mode. Valid values are: + + * ``llvm`` - Process LLVM IR bitcode files (default) + * ``mir`` - Process Machine IR (.mir) files .. option:: -o <filename> @@ -116,8 +139,8 @@ Subcommand-specific options: .. option:: <input-file> - The input LLVM IR or bitcode file to process. This positional argument is - required for the `embeddings` subcommand. + The input LLVM IR/bitcode file (.ll/.bc) or Machine IR file (.mir) to process. + This positional argument is required for the `embeddings` subcommand. .. option:: --level=<level> @@ -131,6 +154,8 @@ Subcommand-specific options: Process only the specified function instead of all functions in the module. +**IR2Vec-specific options** (for ``--mode=llvm``): + .. option:: --ir2vec-kind=<kind> Specify the kind of IR2Vec embeddings to generate. Valid values are: @@ -143,8 +168,8 @@ Subcommand-specific options: .. option:: --ir2vec-vocab-path=<path> - Specify the path to the vocabulary file (required for embedding generation). - The vocabulary file should be in JSON format and contain the trained + Specify the path to the IR2Vec vocabulary file (required for LLVM IR embedding + generation). The vocabulary file should be in JSON format and contain the trained vocabulary for embedding generation. See `llvm/lib/Analysis/models` for pre-trained vocabulary files. @@ -163,6 +188,35 @@ Subcommand-specific options: Specify the weight for argument embeddings (default: 0.2). This controls the relative importance of operand information in the final embedding. +**MIR2Vec-specific options** (for ``--mode=mir``): + +.. option:: --mir2vec-vocab-path=<path> + + Specify the path to the MIR2Vec vocabulary file (required for Machine IR + embedding generation). The vocabulary file should be in JSON format and + contain the trained vocabulary for embedding generation. + +.. option:: --mir2vec-kind=<kind> + + Specify the kind of MIR2Vec embeddings to generate. Valid values are: + + * ``symbolic`` - Generate symbolic embeddings (default) + +.. option:: --mir2vec-opc-weight=<weight> + + Specify the weight for machine opcode embeddings (default: 1.0). This controls + the relative importance of machine instruction opcodes in the final embedding. + +.. option:: --mir2vec-common-operand-weight=<weight> + + Specify the weight for common operand embeddings (default: 1.0). This controls + the relative importance of common operand types in the final embedding. + +.. option:: --mir2vec-reg-operand-weight=<weight> + + Specify the weight for register operand embeddings (default: 1.0). This controls + the relative importance of register operands in the final embedding. + **triplets** subcommand: @@ -240,3 +294,6 @@ SEE ALSO For more information about the IR2Vec algorithm and approach, see: `IR2Vec: LLVM IR Based Scalable Program Embeddings <https://doi.org/10.1145/3418463>`_. + +For more information about the MIR2Vec algorithm and approach, see: +`RL4ReAl: Reinforcement Learning for Register Allocation <https://doi.org/10.1145/3578360.3580273>`_. diff --git a/llvm/include/llvm/CodeGen/MIR2Vec.h b/llvm/include/llvm/CodeGen/MIR2Vec.h index 953e590a6d64f..f47d9abb042d8 100644 --- a/llvm/include/llvm/CodeGen/MIR2Vec.h +++ b/llvm/include/llvm/CodeGen/MIR2Vec.h @@ -7,9 +7,20 @@ //===----------------------------------------------------------------------===// /// /// \file -/// This file defines the MIR2Vec vocabulary -/// analysis(MIR2VecVocabLegacyAnalysis), the core mir2vec::MIREmbedder -/// interface for generating Machine IR embeddings, and related utilities. +/// This file defines the MIR2Vec framework for generating Machine IR +/// embeddings. +/// +/// Architecture Overview: +/// ---------------------- +/// 1. MIR2VecVocabProvider - Core vocabulary loading logic (no PM dependency) +/// - Can be used standalone or wrapped by the pass manager +/// - Requires MachineModuleInfo with parsed machine functions +/// +/// 2. MIR2VecVocabLegacyAnalysis - Pass manager wrapper (ImmutablePass) +/// - Integrated and used by llc -print-mir2vec +/// +/// 3. MIREmbedder - Generates embeddings from vocabulary +/// - SymbolicMIREmbedder: MIR2Vec embedding implementation /// /// MIR2Vec extends IR2Vec to support Machine IR embeddings. It represents the /// LLVM Machine IR as embeddings which can be used as input to machine learning @@ -306,26 +317,58 @@ class SymbolicMIREmbedder : public MIREmbedder { } // namespace mir2vec +/// MIR2Vec vocabulary provider used by pass managers and standalone tools. +/// This class encapsulates the core vocabulary loading logic and can be used +/// independently of the pass manager infrastructure. For pass-based usage, +/// see MIR2VecVocabLegacyAnalysis. +/// +/// Note: This provider pattern makes new PM migration straightforward when +/// needed. A new PM analysis wrapper can be added that delegates to this +/// provider, similar to how MIR2VecVocabLegacyAnalysis currently wraps it. +class MIR2VecVocabProvider { + using VocabMap = std::map<std::string, mir2vec::Embedding>; + +public: + MIR2VecVocabProvider(const MachineModuleInfo &MMI) : MMI(MMI) {} + + Expected<mir2vec::MIRVocabulary> getVocabulary(const Module &M); + +private: + Error readVocabulary(VocabMap &OpcVocab, VocabMap &CommonOperandVocab, + VocabMap &PhyRegVocabMap, VocabMap &VirtRegVocabMap); + const MachineModuleInfo &MMI; +}; + /// Pass to analyze and populate MIR2Vec vocabulary from a module class MIR2VecVocabLegacyAnalysis : public ImmutablePass { using VocabVector = std::vector<mir2vec::Embedding>; using VocabMap = std::map<std::string, mir2vec::Embedding>; - std::optional<mir2vec::MIRVocabulary> Vocab; StringRef getPassName() const override; - Error readVocabulary(VocabMap &OpcVocab, VocabMap &CommonOperandVocab, - VocabMap &PhyRegVocabMap, VocabMap &VirtRegVocabMap); protected: void getAnalysisUsage(AnalysisUsage &AU) const override { AU.addRequired<MachineModuleInfoWrapperPass>(); AU.setPreservesAll(); } + std::unique_ptr<MIR2VecVocabProvider> Provider; public: static char ID; MIR2VecVocabLegacyAnalysis() : ImmutablePass(ID) {} - Expected<mir2vec::MIRVocabulary> getMIR2VecVocabulary(const Module &M); + + Expected<mir2vec::MIRVocabulary> getMIR2VecVocabulary(const Module &M) { + MachineModuleInfo &MMI = + getAnalysis<MachineModuleInfoWrapperPass>().getMMI(); + if (!Provider) + Provider = std::make_unique<MIR2VecVocabProvider>(MMI); + return Provider->getVocabulary(M); + } + + MIR2VecVocabProvider &getProvider() { + assert(Provider && "Provider not initialized"); + return *Provider; + } }; /// This pass prints the embeddings in the MIR2Vec vocabulary diff --git a/llvm/lib/CodeGen/MIR2Vec.cpp b/llvm/lib/CodeGen/MIR2Vec.cpp index 716221101af9f..69c1e28e55e3b 100644 --- a/llvm/lib/CodeGen/MIR2Vec.cpp +++ b/llvm/lib/CodeGen/MIR2Vec.cpp @@ -412,24 +412,39 @@ Expected<MIRVocabulary> MIRVocabulary::createDummyVocabForTest( } //===----------------------------------------------------------------------===// -// MIR2VecVocabLegacyAnalysis Implementation +// MIR2VecVocabProvider and MIR2VecVocabLegacyAnalysis //===----------------------------------------------------------------------===// -char MIR2VecVocabLegacyAnalysis::ID = 0; -INITIALIZE_PASS_BEGIN(MIR2VecVocabLegacyAnalysis, "mir2vec-vocab-analysis", - "MIR2Vec Vocabulary Analysis", false, true) -INITIALIZE_PASS_DEPENDENCY(MachineModuleInfoWrapperPass) -INITIALIZE_PASS_END(MIR2VecVocabLegacyAnalysis, "mir2vec-vocab-analysis", - "MIR2Vec Vocabulary Analysis", false, true) +Expected<mir2vec::MIRVocabulary> +MIR2VecVocabProvider::getVocabulary(const Module &M) { + VocabMap OpcVocab, CommonOperandVocab, PhyRegVocabMap, VirtRegVocabMap; -StringRef MIR2VecVocabLegacyAnalysis::getPassName() const { - return "MIR2Vec Vocabulary Analysis"; + if (Error Err = readVocabulary(OpcVocab, CommonOperandVocab, PhyRegVocabMap, + VirtRegVocabMap)) + return std::move(Err); + + for (const auto &F : M) { + if (F.isDeclaration()) + continue; + + if (auto *MF = MMI.getMachineFunction(F)) { + auto &Subtarget = MF->getSubtarget(); + if (const auto *TII = Subtarget.getInstrInfo()) + if (const auto *TRI = Subtarget.getRegisterInfo()) + return mir2vec::MIRVocabulary::create( + std::move(OpcVocab), std::move(CommonOperandVocab), + std::move(PhyRegVocabMap), std::move(VirtRegVocabMap), *TII, *TRI, + MF->getRegInfo()); + } + } + return createStringError(errc::invalid_argument, + "No machine functions found in module"); } -Error MIR2VecVocabLegacyAnalysis::readVocabulary(VocabMap &OpcodeVocab, - VocabMap &CommonOperandVocab, - VocabMap &PhyRegVocabMap, - VocabMap &VirtRegVocabMap) { +Error MIR2VecVocabProvider::readVocabulary(VocabMap &OpcodeVocab, + VocabMap &CommonOperandVocab, + VocabMap &PhyRegVocabMap, + VocabMap &VirtRegVocabMap) { if (VocabFile.empty()) return createStringError( errc::invalid_argument, @@ -478,49 +493,15 @@ Error MIR2VecVocabLegacyAnalysis::readVocabulary(VocabMap &OpcodeVocab, return Error::success(); } -Expected<mir2vec::MIRVocabulary> -MIR2VecVocabLegacyAnalysis::getMIR2VecVocabulary(const Module &M) { - if (Vocab.has_value()) - return std::move(Vocab.value()); - - VocabMap OpcMap, CommonOperandMap, PhyRegMap, VirtRegMap; - if (Error Err = - readVocabulary(OpcMap, CommonOperandMap, PhyRegMap, VirtRegMap)) - return std::move(Err); - - // Get machine module info to access machine functions and target info - MachineModuleInfo &MMI = getAnalysis<MachineModuleInfoWrapperPass>().getMMI(); - - // Find first available machine function to get target instruction info - for (const auto &F : M) { - if (F.isDeclaration()) - continue; - - if (auto *MF = MMI.getMachineFunction(F)) { - auto &Subtarget = MF->getSubtarget(); - const TargetInstrInfo *TII = Subtarget.getInstrInfo(); - if (!TII) { - return createStringError(errc::invalid_argument, - "No TargetInstrInfo available; cannot create " - "MIR2Vec vocabulary"); - } - - const TargetRegisterInfo *TRI = Subtarget.getRegisterInfo(); - if (!TRI) { - return createStringError(errc::invalid_argument, - "No TargetRegisterInfo available; cannot " - "create MIR2Vec vocabulary"); - } - - return mir2vec::MIRVocabulary::create( - std::move(OpcMap), std::move(CommonOperandMap), std::move(PhyRegMap), - std::move(VirtRegMap), *TII, *TRI, MF->getRegInfo()); - } - } +char MIR2VecVocabLegacyAnalysis::ID = 0; +INITIALIZE_PASS_BEGIN(MIR2VecVocabLegacyAnalysis, "mir2vec-vocab-analysis", + "MIR2Vec Vocabulary Analysis", false, true) +INITIALIZE_PASS_DEPENDENCY(MachineModuleInfoWrapperPass) +INITIALIZE_PASS_END(MIR2VecVocabLegacyAnalysis, "mir2vec-vocab-analysis", + "MIR2Vec Vocabulary Analysis", false, true) - // No machine functions available - return error - return createStringError(errc::invalid_argument, - "No machine functions found in module"); +StringRef MIR2VecVocabLegacyAnalysis::getPassName() const { + return "MIR2Vec Vocabulary Analysis"; } //===----------------------------------------------------------------------===// diff --git a/llvm/test/tools/llvm-ir2vec/embeddings-symbolic.mir b/llvm/test/tools/llvm-ir2vec/embeddings-symbolic.mir new file mode 100644 index 0000000000000..e5f78bfd2090e --- /dev/null +++ b/llvm/test/tools/llvm-ir2vec/embeddings-symbolic.mir @@ -0,0 +1,92 @@ +# REQUIRES: x86_64-linux +# RUN: llvm-ir2vec embeddings --mode=mir --mir2vec-vocab-path=%S/../../CodeGen/MIR2Vec/Inputs/mir2vec_dummy_3D_vocab.json %s | FileCheck %s -check-prefix=CHECK-DEFAULT +# RUN: llvm-ir2vec embeddings --mode=mir --level=func --mir2vec-vocab-path=%S/../../CodeGen/MIR2Vec/Inputs/mir2vec_dummy_3D_vocab.json %s | FileCheck %s -check-prefix=CHECK-FUNC-LEVEL +# RUN: llvm-ir2vec embeddings --mode=mir --level=func --function=add_function --mir2vec-vocab-path=%S/../../CodeGen/MIR2Vec/Inputs/mir2vec_dummy_3D_vocab.json %s | FileCheck %s -check-prefix=CHECK-FUNC-LEVEL-ADD +# RUN: not llvm-ir2vec embeddings --mode=mir --level=func --function=missing_function --mir2vec-vocab-path=%S/../../CodeGen/MIR2Vec/Inputs/mir2vec_dummy_3D_vocab.json %s 2>&1 | FileCheck %s -check-prefix=CHECK-FUNC-MISSING +# RUN: llvm-ir2vec embeddings --mode=mir --level=bb --mir2vec-vocab-path=%S/../../CodeGen/MIR2Vec/Inputs/mir2vec_dummy_3D_vocab.json %s | FileCheck %s -check-prefix=CHECK-BB-LEVEL +# RUN: llvm-ir2vec embeddings --mode=mir --level=inst --function=add_function --mir2vec-vocab-path=%S/../../CodeGen/MIR2Vec/Inputs/mir2vec_dummy_3D_vocab.json %s | FileCheck %s -check-prefix=CHECK-INST-LEVEL + +--- | + target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128" + target triple = "x86_64-unknown-linux-gnu" + + define dso_local noundef i32 @add_function(i32 noundef %a, i32 noundef %b) { + entry: + %sum = add nsw i32 %a, %b + %result = mul nsw i32 %sum, 2 + ret i32 %result + } + + define dso_local void @simple_function() { + entry: + ret void + } +... +--- +name: add_function +alignment: 16 +tracksRegLiveness: true +registers: + - { id: 0, class: gr32 } + - { id: 1, class: gr32 } + - { id: 2, class: gr32 } + - { id: 3, class: gr32 } +liveins: + - { reg: '$edi', virtual-reg: '%0' } + - { reg: '$esi', virtual-reg: '%1' } +body: | + bb.0.entry: + liveins: $edi, $esi + + %1:gr32 = COPY $esi + %0:gr32 = COPY $edi + %2:gr32 = nsw ADD32rr %0, %1, implicit-def dead $eflags + %3:gr32 = ADD32rr %2, %2, implicit-def dead $eflags + $eax = COPY %3 + RET 0, $eax + +--- +name: simple_function +alignment: 16 +tracksRegLiveness: true +body: | + bb.0.entry: + RET 0 + +# CHECK-DEFAULT: MIR2Vec embeddings for machine function add_function: +# CHECK-DEFAULT-NEXT: Function vector: [ 26.50 27.10 27.70 ] +# CHECK-DEFAULT: MIR2Vec embeddings for machine function simple_function: +# CHECK-DEFAULT-NEXT: Function vector: [ 1.10 1.20 1.30 ] + +# CHECK-FUNC-LEVEL: MIR2Vec embeddings for machine function add_function: +# CHECK-FUNC-LEVEL-NEXT: Function vector: [ 26.50 27.10 27.70 ] +# CHECK-FUNC-LEVEL: MIR2Vec embeddings for machine function simple_function: +# CHECK-FUNC-LEVEL-NEXT: Function vector: [ 1.10 1.20 1.30 ] + +# CHECK-FUNC-LEVEL-ADD: MIR2Vec embeddings for machine function add_function: +# CHECK-FUNC-LEVEL-ADD-NEXT: Function vector: [ 26.50 27.10 27.70 ] +# CHECK-FUNC-LEVEL-ADD-NOT: simple_function + +# CHECK-FUNC-MISSING: Error: Function 'missing_function' not found + +# CHECK-BB-LEVEL: MIR2Vec embeddings for machine function add_function: +# CHECK-BB-LEVEL-NEXT: Basic block vectors: +# CHECK-BB-LEVEL-NEXT: MBB entry: [ 26.50 27.10 27.70 ] +# CHECK-BB-LEVEL: MIR2Vec embeddings for machine function simple_function: +# CHECK-BB-LEVEL-NEXT: Basic block vectors: +# CHECK-BB-LEVEL-NEXT: MBB entry: [ 1.10 1.20 1.30 ] + +# CHECK-INST-LEVEL: MIR2Vec embeddings for machine function add_function: +# CHECK-INST-LEVEL-NEXT: Instruction vectors: +# CHECK-INST-LEVEL: %1:gr32 = COPY $esi +# CHECK-INST-LEVEL-NEXT: -> [ 6.00 6.10 6.20 ] +# CHECK-INST-LEVEL-NEXT: %0:gr32 = COPY $edi +# CHECK-INST-LEVEL-NEXT: -> [ 6.00 6.10 6.20 ] +# CHECK-INST-LEVEL: %2:gr32 = nsw ADD32rr +# CHECK-INST-LEVEL: -> [ 3.70 3.80 3.90 ] +# CHECK-INST-LEVEL: %3:gr32 = ADD32rr +# CHECK-INST-LEVEL: -> [ 3.70 3.80 3.90 ] +# CHECK-INST-LEVEL: $eax = COPY %3:gr32 +# CHECK-INST-LEVEL-NEXT: -> [ 6.00 6.10 6.20 ] +# CHECK-INST-LEVEL: RET 0, $eax +# CHECK-INST-LEVEL-NEXT: -> [ 1.10 1.20 1.30 ] diff --git a/llvm/test/tools/llvm-ir2vec/error-handling.mir b/llvm/test/... [truncated] `````````` </details> https://github.com/llvm/llvm-project/pull/164025 _______________________________________________ llvm-branch-commits mailing list [email protected] https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
