[PR] fix(server): support dedicated backend data structures and serialization logic for vector index. [incubator-hugegraph]

via GitHub Sun, 23 Nov 2025 03:47:22 -0800


hahahahbenny opened a new pull request, #2913:
URL: https://github.com/apache/incubator-hugegraph/pull/2913


   <!-- 
     Thank you very much for contributing to Apache HugeGraph, we are happy 
that you want to help us improve it!
   
     Here are some tips for you:
       1. If this is your first time, please read the [contributing 
guidelines](https://github.com/apache/hugegraph/blob/master/CONTRIBUTING.md)
   
       2. If a PR fix/close an issue, type the message "close xxx" (xxx is the 
link of related 
   issue) in the content, GitHub will auto link it (Required)
   
       3. Name the PR title in "Google Commit Format", start with "feat | fix | 
perf | refactor | doc | chore", 
         such like: "feat(core): support the PageRank algorithm" or "fix: wrong 
break in the compute loop" (module is optional)
         skip it if you are unsure about which is the best component.
   
       4. One PR address one issue, better not to mix up multiple issues.
   
       5. Put an `x` in the `[ ]` to mark the item as CHECKED. `[x]` (or click 
it directly after 
   published)
   -->
   
   ## Purpose of the PR
   
   To support vector indexing in HugeGraph, dedicated backend data structures 
and serialization logic need to be added.
   
   <!--
   Please explain more context in this section, clarify why the changes are 
needed. 
   e.g:
   - If you propose a new API, clarify the use case for a new API.
   - If you fix a bug, you can clarify why it is a bug, and should be 
associated with an issue.
   -->
   
   ## Main Changes
   
   1. **In-Memory Data Structure**
      ```java
      HugeVectorIndexMap
      ```
      - Represents a single vector-index entry at runtime.  
      - Carries `sequence` (offset) and `IndexVectorState` (dirty flag, 
metadata).
   
   2. **Type-System Extensions**
      | Enum | New Constant | Purpose |
      |---|---|---|
      | `IndexType` | `VECTOR` | Top-level kind for vector indices. |
      | `HugeType` | `VECTOR_INDEX_MAP` | Identifies the per-entry record. |
      | `HugeType` | `VECTOR_SEQUENCE` | Identifies the global sequence 
counter. |
   
   2.1 Column family design
   
   ### VECTOR_INDEX_MAP
   
   - `elemId`: ID of the vertex being indexed.  
   - `sequence`: `long`.  
   - `vectorStateCode = IndexVectorState.code()`: state of the vector index 
(e.g., BUILDING / FLUSHED / DELETING).
   
   | Item | Design |
   | ---- | ---- |
   | Key | `[1B type][4B indexId][4B vectorId][4B elemId][(optional) VLong 
expiredTime]` |
   | Value | Fixed 9 bytes: `[8B sequence][1B vectorStateCode]` |
   
   ### VECTOR_SEQUENCE
   
   - `vectorId` encoded as `int`.
   
   | Item | Design |
   | ---- | ---- |
   | Key | `[1B dirty_prefix][4B indexId][8B sequence][4B vectorId]` |
   | Value | sequence (8B long) + state (1B IndexVectorState.code)` |
   
   2.2 vector index state machine
   
   ```mermaid
   stateDiagram-v2
       [*] --> BUILDING: user writes vector<br/>GraphTransaction.commit()
   
       BUILDING --> FLUSHED: VectorIndexManager consumes<br/>and flushes to 
snapshot
   
       FLUSHED --> BUILDING: user modifies vector<br/>update operation
   
       FLUSHED --> DELETING: user deletes vector<br/>delete operation
   
       BUILDING --> DELETING: vector under construction deleted<br/>delete 
operation
   
       DELETING --> BUILDING: deleted vector re-written<br/>write operation
   
       DELETING --> [*]: VectorIndexManager consumes deletion<br/>physically 
purged from RocksDB
   ```
   
   3. **On-Disk Binary Layout**  
      Serializer entry points:
      ```java
      BinarySerializer#writeIndex(HugeVectorIndexMap)
      BinarySerializer#writeVectorSequence(...)
      BinarySerializer#readVectorSequence(...)
      ```
      Target column families:
      - `cf_vector_index_map` – stores the state of each vector index.  
      - `cf_vector_seq_index` – stores monotonically increasing sequence IDs.
   
   4. **Test Coverage**
      ```java
      VectorIndexSerializerTest
      ```
      Locks down the byte-level format for:
      - sequence id  
      - dirty marker id  
      - vector index value  
      - sequence entry record  
      Guarantees backward compatibility if the format ever changes.
   
   ---
   
   1. **内存数据结构**
      ```java
      HugeVectorIndexMap
      ```
      - 运行时承载单条向量索引入口。  
      - 内置 `sequence`（偏移）与 `IndexVectorState`（脏标记及元数据）。
   
   2. **类型系统扩展**
      | 枚举 | 新增常量 | 用途 |
      |---|---|---|
      | `IndexType` | `VECTOR` | 向量索引顶层类型。 |
      | `HugeType` | `VECTOR_INDEX_MAP` | 标识单条记录。 |
      | `HugeType` | `VECTOR_SEQUENCE` | 标识全局序列计数器。 |
   
   2.1 两个CF的key value设计
   
   ### VECTOR_INDEX_MAP
   
   - 其中 `elemId` 为被索引顶点 ID。
   - `sequence` 为 `long`
   -  `vectorStateCode = IndexVectorState.code()` 表示向量索引状态（如 
BUILDING/FLUSHED/DELETING）。
   
   | 项目 | 设计 |
   | ---- | ---- |
   | Key  |`[1B type][4B indexId][4B vectorId][4B elemId][(可选)VLong 
expiredTime]` |
   | Value | 固定 9 字节：`[8B sequence][1B vectorStateCode]`； |
   
   ### VECTOR_SEQUENCE
   
   - 其中 `vectorId` 作为 `int` 编码
   
   | 项目 | 设计 |
   | ---- | ---- |
   | Key  | [1B dirty_prefix] + [4B indexId] + [8B Sequence] + [4B vectorId]|
   | Value | sequence(8B long) + state(1B IndexVectorState.code) |
   
   
   2.2 vector state 状态以及状态机变化
   
   ```mermaid
   stateDiagram-v2
       [*] --> BUILDING: 用户写入向量<br/>GraphTransaction.commit()
   
       BUILDING --> FLUSHED: VectorIndexManager 消费<br/>并落盘到快照
   
       FLUSHED --> BUILDING: 用户修改向量<br/>更新操作
   
       FLUSHED --> DELETING: 用户删除向量<br/>删除操作
   
       BUILDING --> DELETING: 构建中的向量被删除<br/>删除操作
   
       DELETING --> BUILDING: 已删除向量被重新写入<br/>写入操作
   
       DELETING --> [*]: VectorIndexManager 消费删除<br/>物理清理 RocksDB
   ```
   
   3. **落盘二进制格式**  
      序列化入口：
      ```java
      BinarySerializer#writeIndex(HugeVectorIndexMap)
      BinarySerializer#writeVectorSequence(...)
      BinarySerializer#readVectorSequence(...)
      ```
      目标列族：
      - `cf_vector_index_map` – 存储每条向量索引状态。  
      - `cf_vector_seq_index` – 存储单调递增的序列号。
   
   4. **测试锁定**
      ```java
      VectorIndexSerializerTest
      ```
      固化以下字段的字节级格式：
      - 序列号  
      - 脏标记号  
      - 向量索引值  
      - 序列条目  
      防止后续意外变更，保证向后兼容。
   
   
   <!-- Please clarify what changes you are proposing. The purpose of this 
section is to outline the changes and how this PR fixes the issue. These change 
logs are helpful for better and faster reviews.)
   
   For example:
   
   - If you introduce a new feature, please show detailed design here or add 
the link of design documentation.
   - If you refactor some codes with changing classes, showing the class 
hierarchy will help reviewers.
   - If there is a discussion in the mailing list, please add the link. -->
   
   ## Verifying these changes
   
   <!-- Please pick the proper options below -->
   
   - [ ] Trivial rework / code cleanup without any test coverage. (No Need)
   - [x] Already covered by existing tests, such as *(please modify tests 
here)*.
   - [ ] Need tests and can be verified as follows:
       - xxx
   
   ## Does this PR potentially affect the following parts?
   
   <!-- DO NOT REMOVE THIS SECTION. CHECK THE PROPER BOX ONLY. -->
   
   - [ ]  Dependencies ([add/update 
license](https://hugegraph.apache.org/docs/contribution-guidelines/contribute/#321-check-licenses)
 info & 
[regenerate_known_dependencies.sh](../install-dist/scripts/dependency/regenerate_known_dependencies.sh))
 <!-- Don't forget to add/update the info in "LICENSE" & "NOTICE" files (both 
in root & dist module) -->
   - [ ]  Modify configurations
   - [ ]  The public API
   - [x]  Other affects (new huge type , new structure)
   - [ ]  Nope
   
   
   ## Documentation Status
   
   <!-- DO NOT REMOVE THIS SECTION. CHECK THE PROPER BOX ONLY. -->
   
   - [ ]  `Doc - TODO` <!-- Your PR changes impact docs and you will update 
later -->
   - [x]  `Doc - Done` <!-- Related docs have been already added or updated -->
   - [ ]  `Doc - No Need` <!-- Your PR changes don't impact/need docs -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] fix(server): support dedicated backend data structures and serialization logic for vector index. [incubator-hugegraph]

Reply via email to