================ @@ -137,13 +138,48 @@ using InstEmbeddingsMap = DenseMap<const Instruction *, Embedding>; using BBEmbeddingsMap = DenseMap<const BasicBlock *, Embedding>; /// Class for storing and accessing the IR2Vec vocabulary. -/// Encapsulates all vocabulary-related constants, logic, and access methods. +/// +/// The Vocabulary class manages seed embeddings for LLVM IR entities. It +/// contains the seed embeddings for three types of entities: instruction +/// opcodes, types, and operands. Types are grouped/canonicalized for better +/// learning (e.g., all float variants map to FloatTy). The vocabulary abstracts +/// away the canonicalization effectively, the exposed APIs handle all the known +/// LLVM IR opcodes, types and operands. +/// +/// This class helps populate the seed embeddings in an internal vector-based +/// ADT. It provides logic to map every IR entity to a specific slot index or +/// position in this vector, enabling O(1) embedding lookup while avoiding +/// unnecessary computations involving string based lookups while generating the +/// embeddings. class Vocabulary { friend class llvm::IR2VecVocabAnalysis; using VocabVector = std::vector<ir2vec::Embedding>; VocabVector Vocab; bool Valid = false; +public: + // Slot layout: ---------------- mtrofin wrote:
Not for this patch, but curious (and if it makes sense, it can be addressed later): why not 3 vectors rather than indexing within one? Also, IIUC the tools have some understanding that this is the case (there's one vector with slots) and thus need to use getSlotIndex. Is this an artifact of how serialization happens? Maybe that can be captured in the comment here (to be clear, I don't think 3 vectors vs one is better or worse, just a bit of a confusing design choice so a comment may help) https://github.com/llvm/llvm-project/pull/155323 _______________________________________________ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits