hgaol commented on PR #1510:
URL: https://github.com/apache/answer/pull/1510#issuecomment-4277236383

   > > There're also some follow ups I don't have clear answer for now. Post 
here for discussion.
   > > ## Follow-up: When to calc the embeddings and sync to vector storage
   > > ### Comparison with Search Plugin
   > > Aspect   Search Plugin   VectorSearch Plugin
   > > Bulk sync        Yes     Yes
   > > Real-time sync   Yes (create/update/delete hooks)        No
   > > Trigger  Event-driven + startup/config   Startup/config only
   > > Consistency      Near real-time  Eventually consistent
   > > ### Current Gap
   > > `UpdateContent()` / `DeleteContent()` exist in `plugin.VectorSearch`, 
but are not called from question/answer service events. So after initial sync, 
content changes are not reflected until next full re-sync. It's good to keep 
here since we'll need them when finalized the syncing solution.
   > > ### Options
   > > Below are 3 options in my mind. Of course, we can add another setting 
for each vector plugin to support all 3 options.
   > > 
   > > 1. **Manual (current)**
   > >    
   > >    * Re-sync only on plugin config save/update
   > >    * Simple, but stale results between syncs
   > > 2. **Real-time**
   > >    
   > >    * Add event hooks to call vector search update/delete
   > >    * Can be async (goroutine / queue) to avoid write-path latency
   > >    * Higher embedding API call volume
   > > 3. **Scheduled (cron)**
   > >    
   > >    * Periodic bulk sync via cron expression
   > >    * Good for off-peak syncing
   > >    * Delayed freshness until next run
   > 
   > @hgaol This is a great job—thanks again.
   > 
   > I think your analysis makes sense. We should go with the real-time 
approach.
   > 
   > In my view, semantic search is much more sensitive to stale data than 
normal keyword search. If we only rely on manual or scheduled full sync, the 
vector store can easily drift from the source of truth after 
question/answer/comment edits, status changes, or deletions.
   > 
   > So I think the better direction is:
   > 
   > 1. keep the existing bulk sync as the bootstrap/rebuild mechanism
   > 2. add real-time hooks on create/update/delete/status-change events
   > 3. perform vector store updates asynchronously, so we don’t add latency to 
the main write path
   > 
   > This way we get both:
   > 
   > * full sync for initialization/recovery
   > * near real-time consistency for day-to-day usage
   > 
   > **In short, I haven't seen you use the `DeleteContent` and `UpdateContent` 
methods.**
   > 
   > I think this will make the design much more complete. We have no issues 
with the implementation and design of the other interfaces.
   
   Thanks @LinkinStars for this suggestion! Agreed with the real-time approach. 
I'll follow these suggestions and implement it early this week.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to