Re: [I] Indexing "invalid" HNSW vectors does not trigger error until merge and/or CheckIndex [lucene]

via GitHub Thu, 29 Jan 2026 14:13:54 -0800


hossman commented on issue #15540:
URL: https://github.com/apache/lucene/issues/15540#issuecomment-3820706190


   > Yeah +1. It's kinda the vector equivalent of the empty string term, which 
is indeed a valid term that you can index in Lucene if your tokenizer produces 
it.
   
   Except that having an empty string term doesn't trip any assertions during 
segment merge, or when running `CheckIndex`
   
   That to me seems like the biggest problem(s) here.
   
   One of two things needs to be true:
   
   1. This index is "valid", **_in spite of_** the vector value+similarity 
combo being invalid, and neither segment merging nor `CheckIndex` should care
        - even if this document will never (or always) be found via a vector 
search query, that is none of the business of the index/merge/check logic
        - analogous to indexing an empty string term
   1.  The index is "corrupt" _**because**_ the vector value+similarity combo 
is invalid, and some code path should have stopped this document from ever 
being added to the index in the first place.
        - either `new KnnByteVectorField(...)` or 
`IndexWriter.addDocument(...)` should have thrown an exception
        - analogous to trying to index a negative position increment in a token 
stream
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Indexing "invalid" HNSW vectors does not trigger error until merge and/or CheckIndex [lucene]

Reply via email to