hgaol commented on PR #1510: URL: https://github.com/apache/answer/pull/1510#issuecomment-4277236383
> > There're also some follow ups I don't have clear answer for now. Post here for discussion. > > ## Follow-up: When to calc the embeddings and sync to vector storage > > ### Comparison with Search Plugin > > Aspect Search Plugin VectorSearch Plugin > > Bulk sync Yes Yes > > Real-time sync Yes (create/update/delete hooks) No > > Trigger Event-driven + startup/config Startup/config only > > Consistency Near real-time Eventually consistent > > ### Current Gap > > `UpdateContent()` / `DeleteContent()` exist in `plugin.VectorSearch`, but are not called from question/answer service events. So after initial sync, content changes are not reflected until next full re-sync. It's good to keep here since we'll need them when finalized the syncing solution. > > ### Options > > Below are 3 options in my mind. Of course, we can add another setting for each vector plugin to support all 3 options. > > > > 1. **Manual (current)** > > > > * Re-sync only on plugin config save/update > > * Simple, but stale results between syncs > > 2. **Real-time** > > > > * Add event hooks to call vector search update/delete > > * Can be async (goroutine / queue) to avoid write-path latency > > * Higher embedding API call volume > > 3. **Scheduled (cron)** > > > > * Periodic bulk sync via cron expression > > * Good for off-peak syncing > > * Delayed freshness until next run > > @hgaol This is a great job—thanks again. > > I think your analysis makes sense. We should go with the real-time approach. > > In my view, semantic search is much more sensitive to stale data than normal keyword search. If we only rely on manual or scheduled full sync, the vector store can easily drift from the source of truth after question/answer/comment edits, status changes, or deletions. > > So I think the better direction is: > > 1. keep the existing bulk sync as the bootstrap/rebuild mechanism > 2. add real-time hooks on create/update/delete/status-change events > 3. perform vector store updates asynchronously, so we don’t add latency to the main write path > > This way we get both: > > * full sync for initialization/recovery > * near real-time consistency for day-to-day usage > > **In short, I haven't seen you use the `DeleteContent` and `UpdateContent` methods.** > > I think this will make the design much more complete. We have no issues with the implementation and design of the other interfaces. Thanks @LinkinStars for this suggestion! Agreed with the real-time approach. I'll follow these suggestions and implement it early this week. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
