[PR] chore: introduce VectorIndexManager runtime framework with incremental sync, ANN search and versioned persistence [incubator-hugegraph]

via GitHub Fri, 19 Dec 2025 03:34:21 -0800


hahahahbenny opened a new pull request, #2922:
URL: https://github.com/apache/incubator-hugegraph/pull/2922


   <!-- 
     Thank you very much for contributing to Apache HugeGraph, we are happy 
that you want to help us improve it!
   
     Here are some tips for you:
       1. If this is your first time, please read the [contributing 
guidelines](https://github.com/apache/hugegraph/blob/master/CONTRIBUTING.md)
   
       2. If a PR fix/close an issue, type the message "close xxx" (xxx is the 
link of related 
   issue) in the content, GitHub will auto link it (Required)
   
       3. Name the PR title in "Google Commit Format", start with "feat | fix | 
perf | refactor | doc | chore", 
         such like: "feat(core): support the PageRank algorithm" or "fix: wrong 
break in the compute loop" (module is optional)
         skip it if you are unsure about which is the best component.
   
       4. One PR address one issue, better not to mix up multiple issues.
   
       5. Put an `x` in the `[ ]` to mark the item as CHECKED. `[x]` (or click 
it directly after 
   published)
   -->
   
   ## Purpose of the PR
   
   This PR implements the vector-index runtime management framework 
(`VectorIndexManager`), which coordinates data synchronization between the 
RocksDB storage layer and the JVector in-memory index, and supports incremental 
vector updates, ANN search, and index persistence.
   
   
   <!--
   Please explain more context in this section, clarify why the changes are 
needed. 
   e.g:
   - If you propose a new API, clarify the use case for a new API.
   - If you fix a bug, you can clarify why it is a bug, and should be 
associated with an issue.
   -->
   
   ## Architecture
   
   ### Overall Architecture
   
   ```mermaid
   flowchart TB
       subgraph Manager["VectorIndexManager (Coordinator)"]
           M[VectorIndexManager]
           M --> |init/stop| INIT[Lifecycle Management]
           M --> |signal| SIG[Trigger Incremental Sync]
           M --> |searchVector| SEARCH[Vector Search]
       end
     
       subgraph Components["Three Core Components"]
           RT[VectorRuntime<br/>JVector In-Memory Index]
           SS[VectorStateStore<br/>RocksDB State Storage]
           SC[VectorTaskScheduler<br/>Async Task Scheduler]
       end
     
       M --> RT
       M --> SS
       M --> SC
     
       RT --> |search| JV[JVector HNSW]
       RT --> |flush| DISK[(Disk Persistence)]
       SS --> |scanDeltas| ROCKS[(RocksDB)]
       SS --> |getVertex| ROCKS
       SC --> |execute| EH[EventHub]
   ```
   
   ### Data Flow
   
   ```mermaid
   sequenceDiagram
       participant GIT as GraphIndexTransaction
       participant M as VectorIndexManager
       participant SC as Scheduler
       participant SS as StateStore
       participant RT as Runtime
       participant JV as JVector
   
       Note over GIT,JV: Write Flow (async)
       GIT->>M: signal(indexLabelId)
       M->>SC: execute(task)
       SC->>SS: scanDeltas(indexLabelId, fromSeq)
       SS-->>SC: Iterator VectorRecord
       SC->>RT: update(indexLabelId, records)
       RT->>JV: addGraphNode / markNodeDeleted
   
       Note over GIT,JV: Search Flow (sync)
       GIT->>M: searchVector(indexLabelId, vector, topK)
       M->>RT: search(indexLabelId, vector, topK)
       RT->>JV: GraphSearcher.search()
       JV-->>RT: Iterator vectorId
       RT-->>M: Iterator vectorId
       M->>SS: getVertex(indexLabelId, vectorIds)
       SS-->>M: Set vertexId
       M-->>GIT: Set Id
   ```
   
   ## Main Changes
   
   ### hugegraph-common (abstraction layer)
   
   
   | File                    | Responsibility                                   
                               |
   | ----------------------- | 
------------------------------------------------------------------------------- 
|
   | `VectorIndexManager`    | Coordinator, manages lifecycle and interaction 
of the three components          |
   | `VectorIndexRuntime`    | Runtime interface, defines operations such as 
update/search/flush               |
   | `AbstractVectorRuntime` | Abstract runtime implementation, manages 
IndexContext and versioned persistence |
   | `VectorIndexStateStore` | State storage interface, defines 
scanDeltas/getVertex operations                |
   | `VectorTaskScheduler`   | Task scheduling interface, supports async task 
execution                        |
   | `VectorRecord`          | Vector record DTO, contains 
vectorId/vector/deleted/sequence                    |
   
   ### hugegraph-core (server-side implementation)
   
   
   | File                     | Responsibility                                  
                              |
   | ------------------------ | 
----------------------------------------------------------------------------- |
   | `ServerVectorRuntime`    | JVector runtime implementation, supports 
COSINE/EUCLIDEAN/DOT_PRODUCT         |
   | `ServerVectorStateStore` | RocksDB state storage implementation, scans 
increments based on IdPrefixQuery |
   | `ServerVectorScheduler`  | Event-driven scheduling implementation based on 
EventHub                      |
   
   ## Core Design
   
   ### 1. Incremental Sync Mechanism
   
   Uses `sequence` watermarks to track and sync only newly added/modified 
vector records to the JVector in-memory index.
   
   ### 2. IndexContext Management
   
   Each IndexLabel corresponds to one IndexContext, which encapsulates vector 
data, JVector builder, and metadata.
   
   ### 3. Versioned Persistence
   
   Employs symbolic link switching to support atomic version updates and 
rollback of old versions.
   
   ```mermaid
   flowchart LR
       subgraph Dir["Directory Structure"]
           BASE["{basePath}/{indexLabelId}/"]
           BASE --> CUR["current → version_xxx (symlink)"]
           BASE --> V1["version_20250101_120000/"]
           BASE --> V2["version_20250101_110000/"]
           V1 --> IDX1["index.inline"]
           V1 --> META1["vector_meta.json"]
       end
   ```
   
   ### 4. Soft Delete Strategy
   
   Deletion operations only mark nodes as deleted; actual cleanup occurs during 
flush.
   
   ## Search Flow
   
   Search returns `Set<Id>` (vertexId), which can be directly used to build 
`FixedIdHolder` for `IdHolderList`, seamlessly integrating with the existing 
index query framework.
   
   ## New Dependencies
   
   
   | Dependency | Version | Purpose                          |
   | ---------- | ------- | -------------------------------- |
   | jvector    | 3.0.0   | HNSW vector index implementation |
   
   <!-- Please clarify what changes you are proposing. The purpose of this 
section is to outline the changes and how this PR fixes the issue. These change 
logs are helpful for better and faster reviews.)
   
   For example:
   
   - If you introduce a new feature, please show detailed design here or add 
the link of design documentation.
   - If you refactor some codes with changing classes, showing the class 
hierarchy will help reviewers.
   - If there is a discussion in the mailing list, please add the link. -->
   
   
   ## Follow-up Work
   
   ### To be completed
   
   - [ ]  Integrate into `GraphIndexTransaction.queryByUserprop()` query path
   - [ ]  Implement `doVectorIndex()` method
   - [ ]  REST API / Gremlin Step support for vector search syntax
   - [ ]  Old version file cleanup during `stop()`
   
   ### Tests to be added
   
   
   | Test Type        | Test Content                                      |
   | ---------------- | ------------------------------------------------- |
   | Unit test        | VectorIndexManager lifecycle                      |
   | Unit test        | ServerVectorRuntime incremental update and search |
   | Unit test        | AbstractVectorRuntime versioned persistence       |
   | Integration test | End-to-end search with RocksDB + JVector          |
   | Performance test | Search latency under different vector scales      |
   
   ## Verifying these changes
   
   <!-- Please pick the proper options below -->
   
   - [ ] Trivial rework / code cleanup without any test coverage. (No Need)
   - [ ] Already covered by existing tests, such as *(please modify tests 
here)*.
   - [X] Need tests and can be verified as follows:
       - xxx
   
   ## Does this PR potentially affect the following parts?
   
   <!-- DO NOT REMOVE THIS SECTION. CHECK THE PROPER BOX ONLY. -->
   
   - [x]  Dependencies ([add/update 
license](https://hugegraph.apache.org/docs/contribution-guidelines/contribute/#321-check-licenses)
 info & 
[regenerate_known_dependencies.sh](../install-dist/scripts/dependency/regenerate_known_dependencies.sh))
 <!-- Don't forget to add/update the info in "LICENSE" & "NOTICE" files (both 
in root & dist module) -->
   - [ ]  Modify configurations
   - [ ]  The public API
   - [X]  Other affects (add new framwork to manage vector index)
   - [ ]  Nope
   
   
   ## Documentation Status
   
   <!-- DO NOT REMOVE THIS SECTION. CHECK THE PROPER BOX ONLY. -->
   
   - [ ]  `Doc - TODO` <!-- Your PR changes impact docs and you will update 
later -->
   - [ ]  `Doc - Done` <!-- Related docs have been already added or updated -->
   - [ ]  `Doc - No Need` <!-- Your PR changes don't impact/need docs -->
   
   ## 概述
   
   本 PR 实现了向量索引的运行时管理框架（`VectorIndexManager`），负责协调 RocksDB 存储层与 JVector 
内存索引之间的数据同步，支持向量的增量更新、ANN 搜索和索引持久化。
   
   ## 架构设计
   
   ### 整体架构
   
   ```mermaid
   flowchart TB
       subgraph Manager["VectorIndexManager（协调器）"]
           M[VectorIndexManager]
           M --> |init/stop| INIT[生命周期管理]
           M --> |signal| SIG[触发增量同步]
           M --> |searchVector| SEARCH[向量搜索]
       end
       
       subgraph Components["三大核心组件"]
           RT[VectorRuntime<br/>JVector内存索引]
           SS[VectorStateStore<br/>RocksDB状态存储]
           SC[VectorTaskScheduler<br/>异步任务调度]
       end
       
       M --> RT
       M --> SS
       M --> SC
       
       RT --> |search| JV[JVector HNSW]
       RT --> |flush| DISK[(磁盘持久化)]
       SS --> |scanDeltas| ROCKS[(RocksDB)]
       SS --> |getVertex| ROCKS
       SC --> |execute| EH[EventHub]
   ```
   
   ### 数据流
   
   ```mermaid
   sequenceDiagram
       participant GIT as GraphIndexTransaction
       participant M as VectorIndexManager
       participant SC as Scheduler
       participant SS as StateStore
       participant RT as Runtime
       participant JV as JVector
   
       Note over GIT,JV: 写入流程（异步）
       GIT->>M: signal(indexLabelId)
       M->>SC: execute(task)
       SC->>SS: scanDeltas(indexLabelId, fromSeq)
       SS-->>SC: Iterator VectorRecord
       SC->>RT: update(indexLabelId, records)
       RT->>JV: addGraphNode / markNodeDeleted
   
       Note over GIT,JV: 搜索流程（同步）
       GIT->>M: searchVector(indexLabelId, vector, topK)
       M->>RT: search(indexLabelId, vector, topK)
       RT->>JV: GraphSearcher.search()
       JV-->>RT: Iterator vectorId
       RT-->>M: Iterator vectorId
       M->>SS: getVertex(indexLabelId, vectorIds)
       SS-->>M: Set vertexId
       M-->>GIT: Set Id
   ```
   
   ## 主要变更
   
   ### hugegraph-common（抽象层）
   
   | 文件 | 职责 |
   |------|------|
   | `VectorIndexManager` | 协调器，管理三大组件的生命周期与交互 |
   | `VectorIndexRuntime` | 运行时接口，定义 update/search/flush 等操作 |
   | `AbstractVectorRuntime` | 运行时抽象实现，管理 IndexContext 与版本化持久化 |
   | `VectorIndexStateStore` | 状态存储接口，定义 scanDeltas/getVertex 操作 |
   | `VectorTaskScheduler` | 任务调度接口，支持异步任务执行 |
   | `VectorRecord` | 向量记录 DTO，包含 vectorId/vector/deleted/sequence |
   
   ### hugegraph-core（服务端实现）
   
   | 文件 | 职责 |
   |------|------|
   | `ServerVectorRuntime` | JVector 运行时实现，支持 COSINE/EUCLIDEAN/DOT_PRODUCT |
   | `ServerVectorStateStore` | RocksDB 状态存储实现，基于 IdPrefixQuery 扫描增量 |
   | `ServerVectorScheduler` | 基于 EventHub 的事件驱动调度实现 |
   
   ## 核心设计
   
   ### 1. 增量同步机制
   
   通过 `sequence` 水位追踪，仅同步新增或修改的向量记录到 JVector 内存索引。
   
   ### 2. IndexContext 管理
   
   每个 IndexLabel 对应一个 IndexContext，封装向量数据、JVector 构建器与元数据。
   
   ### 3. 版本化持久化
   
   采用符号链接切换机制，支持原子性版本更新与旧版本回滚。
   
   ```mermaid
   flowchart LR
       subgraph Dir["目录结构"]
           BASE["{basePath}/{indexLabelId}/"]
           BASE --> CUR["current → version_xxx（符号链接）"]
           BASE --> V1["version_20250101_120000/"]
           BASE --> V2["version_20250101_110000/"]
           V1 --> IDX1["index.inline"]
           V1 --> META1["vector_meta.json"]
       end
   ```
   
   ### 4. 软删除策略
   
   删除操作仅将节点标记为已删除状态，实际清理在 flush 时进行。
   
   ## 搜索流程
   
   搜索返回 `Set<Id>`（vertexId），可直接用于构建 `FixedIdHolder` 进而得到 
`IdHolderList`，与现有索引查询框架无缝集成。
   
   ## 新增依赖
   
   | 依赖 | 版本 | 用途 |
   |------|------|------|
   | jvector | 3.0.0 | HNSW 向量索引实现 |
   
   <!-- 请阐明您所提议的更改。本节的目的是概述修改内容以及本 PR 如何修复问题。这些更改日志有助于更好更快地进行审查。）
   
   例如：
   
   - 如果您引入了新功能，请在此处展示详细设计或添加设计文档的链接。
   - 如果您重构了一些代码并更改了类，请展示类层次结构以帮助审查者。
   - 如果邮件列表中有讨论，请添加链接。 -->
   
   ## 后续工作
   
   ### 待完成
   
   - [ ] 集成到 `GraphIndexTransaction.queryByUserprop()` 查询路径
   - [ ] 实现 `doVectorIndex()` 方法
   - [ ] REST API / Gremlin Step 支持向量搜索语法
   - [ ] `stop()` 时的旧版本文件清理
   
   ### 待补充测试
   
   | 测试类型 | 测试内容 |
   |----------|----------|
   | 单元测试 | VectorIndexManager 生命周期 |
   | 单元测试 | ServerVectorRuntime 增量更新与搜索 |
   | 单元测试 | AbstractVectorRuntime 版本化持久化 |
   | 集成测试 | RocksDB + JVector 端到端搜索 |
   | 性能测试 | 不同规模向量下的搜索延迟 |
   
   ## 验证这些更改
   
   <!-- 请选择合适的选项 -->
   
   - [ ] 无需测试的微小重构/代码清理。
   - [ ] 已由现有测试覆盖，例如 *(请在此处修改测试)*。
   - [X] 需要测试，可通过以下方式验证：
     - xxx
   
   ## 本 PR 是否可能影响以下部分？
   
   <!-- 请勿移除本节。请仅勾选合适的框。 -->
   
   - [x] 
依赖项（[添加/更新许可证](https://hugegraph.apache.org/docs/contribution-guidelines/contribute/#321-check-licenses)
 信息与 
[regenerate_known_dependencies.sh](../install-dist/scripts/dependency/regenerate_known_dependencies.sh)）<!--
 别忘了在根目录与 dist 模块的 "LICENSE" 和 "NOTICE" 文件中添加/更新信息 -->
   - [ ] 修改配置
   - [ ] 公共 API
   - [X] 其他影响（新增管理向量索引的框架）
   - [ ] 无影响
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] chore: introduce VectorIndexManager runtime framework with incremental sync, ANN search and versioned persistence [incubator-hugegraph]

Reply via email to