[llvm-branch-commits] [llvm] [IR2Vec] Refactor vocabulary to use canonical type IDs (PR #155323)

Mircea Trofin via llvm-branch-commits Wed, 27 Aug 2025 06:51:03 -0700

================
@@ -137,13 +138,48 @@ using InstEmbeddingsMap = DenseMap<const Instruction *, 
Embedding>;
 using BBEmbeddingsMap = DenseMap<const BasicBlock *, Embedding>;
 
 /// Class for storing and accessing the IR2Vec vocabulary.
-/// Encapsulates all vocabulary-related constants, logic, and access methods.
+///
+/// The Vocabulary class manages seed embeddings for LLVM IR entities. It
+/// contains the seed embeddings for three types of entities: instruction
+/// opcodes, types, and operands. Types are grouped/canonicalized for better
+/// learning (e.g., all float variants map to FloatTy). The vocabulary 
abstracts
+/// away the canonicalization effectively, the exposed APIs handle all the 
known
+/// LLVM IR opcodes, types and operands.
+///
+/// This class helps populate the seed embeddings in an internal vector-based
+/// ADT. It provides logic to map every IR entity to a specific slot index or
+/// position in this vector, enabling O(1) embedding lookup while avoiding
+/// unnecessary computations involving string based lookups while generating 
the
+/// embeddings.
 class Vocabulary {
   friend class llvm::IR2VecVocabAnalysis;
   using VocabVector = std::vector<ir2vec::Embedding>;
   VocabVector Vocab;
   bool Valid = false;
 
+public:
+  // Slot layout:
----------------
mtrofin wrote:


Not for this patch, but curious (and if it makes sense, it can be addressed 
later): why not 3 vectors rather than indexing within one? Also, IIUC the tools 
have some understanding that this is the case (there's one vector with slots) 
and thus need to use getSlotIndex. Is this an artifact of how serialization 
happens? Maybe that can be captured in the comment here (to be clear, I don't 
think 3 vectors vs one is better or worse, just a bit of a confusing design 
choice so a comment may help)

https://github.com/llvm/llvm-project/pull/155323
_______________________________________________
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

[llvm-branch-commits] [llvm] [IR2Vec] Refactor vocabulary to use canonical type IDs (PR #155323)

Reply via email to