hgaol commented on PR #1510: URL: https://github.com/apache/answer/pull/1510#issuecomment-4230885772
Updated to involve new `vector search` plugin. See demo below. https://github.com/user-attachments/assets/3b44596e-988d-4576-9a6c-c390177aa5b6 Here's summary of the design and implementation. --- # Vector Search & Semantic Search Design ## Architecture Layers ``` ┌──────────────────────────────────────────────┐ │ AI Chat / MCP Tool ("semantic_search") │ ← Controller layer ├──────────────────────────────────────────────┤ │ EmbeddingService │ ← Service layer (thin facade) ├──────────────────────────────────────────────┤ │ plugin.VectorSearch interface │ ← Plugin abstraction ├──────────────────────────────────────────────┤ │ pgvector / elasticsearch / weaviate / ... │ ← Plugin implementations └──────────────────────────────────────────────┘ ``` ## Plugin Interface (`plugin/vector_search.go`) - `RegisterSyncer(ctx, syncer)` — core provides a syncer for bulk data pull - `SearchSimilar(ctx, query, topK)` — returns `[]VectorSearchResult{ObjectID, ObjectType, Metadata, Score}` - `UpdateContent(ctx, content)` — upserts a document with embedding - `DeleteContent(ctx, objectID)` — removes a document - `ConfigReceiver(config)` / `ConfigFields()` — plugin config lifecycle `GenerateEmbedding()` is the shared embedding utility used by plugins. ## Content Syncing (`vector_search_sync/syncer.go`) Core implements `VectorSearchSyncer` with: - `GetQuestionsPage(page, pageSize)` - `GetAnswersPage(page, pageSize)` Each indexed document aggregates question/answer/comment text. Metadata stores deshortened IDs for reconstruction at query time. Sync is triggered from `RegisterSyncer()` (startup + config update flow). ## Startup & Activation Flow ``` initPluginData(): 1. Load plugin status from DB 2. Call ConfigReceiver for configured plugins -> parse config always -> if active: run heavy init (probe embedding + connect/schema checks) -> if inactive: skip heavy init (IsEnabled guard) 3. Call RegisterSyncer for vector search plugins -> if active/initialized: trigger full sync -> if inactive/uninitialized: skip sync ``` On admin config save: 1. `ConfigReceiver` 2. `UpdatePluginConfig` -> `RegisterSyncer` -> full sync ### Current Behavior Summary - **Active plugin on startup**: does one probe embedding call, then full sync to vector storage. - **Inactive plugin on startup**: parses config only; no probe embedding and no sync. - **Config save for active plugin**: re-runs init path and full sync. ## Semantic Search Query Flow ``` User query -> MCP tool "semantic_search" -> EmbeddingService.SearchSimilar(query, topK) -> plugin.SearchSimilar() returns scored IDs + metadata -> handler fetches full DB content (question/answers/comments) -> returns structured semantic search response ``` --- -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
