hossman opened a new issue, #13105:
URL: https://github.com/apache/lucene/issues/13105

   ### Description
   
   Like most field types in lucene, `Knn(Byte|Float)VectorField` classes do not 
require a field value for every document -- they support use cases where some 
documents are indexed w/o a field value.
   
   The `(Byte|Float)KnnVectorFieldSource` implementations _generally_ follow 
the pattern of most field based `ValueSource` impls in terms of how they behave 
when some documents do not have a value in the specified field 
(`TestKnnVectorSimilarityFunctions` even checks for this in various tests using 
the `knnFloatField3` and `knnByteField3` fields which only exist in one 
document).  
   
   **_But_** these value sources break if/when a user gets unlucky and a 
segment exists where _none_ of the documents have value in the vector field 
(and by extension: when querying an empty index)
   
   This is due to the following check...
   
   ```
        final ByteVectorValues vectorValues = 
readerContext.reader().getByteVectorValues(fieldName);
   
        if (vectorValues == null) {
          throw new IllegalArgumentException(
              "no byte vector value is indexed for field '" + fieldName + "'");
   ...
   ```
   
   This problem is trivial to reproduce reliably by adding an `iw.commit();` 
call to `TestKnnVectorSimilarityFunctions` between indexing the two documents.
     
   (The `RandomIndexWriter` should eventually trigger it in this test even w/o 
modifications .. not sure if it ever has in jenkins?)
   
   ----
   
   The appropriate way to implement this type of check is to follow the pattern 
in the `DocValues.getFoo()` methods, that use a `checkField(...)` method to 
inspect the `FieldInfos` from the `LeafReader` for consistency w/expectations 
if `reader.getFooDocValues()` returns null.  `checkField(...)` assumes 
everything is fine if no `FieldInfo` exist for this field in this reader -- 
using an "empty" impl in it's place.
   
   ----
   
   I'm attaching a patch that fixes these `ValueSource` classes, by returning a 
`VectorFieldFunction` that wraps `DocIdSetIterator.empty()` when `vectorValues` 
is null, as long as a new `VectorEncoding` method `checkField(reader, 
fieldName)` does not throw an exception (Adding `checkField` directly to 
`VectorEncoding` seemed like the most straightforward place to put this ... not 
sure if there is a more appropriate class?) 
   
   I also made the new `checkField` method throw `IllegalStateException` 
instead of `IllegalArgumentException` to be consistent with other "is the index 
field type consistent with the type of query you are trying to do?" checks in 
the code base (like `DocValues.checkField(...)` and updated 
`TestKnnVectorSimilarityFunctions` accordingly.
   
   Patch: 
[VectorFieldSource.fix-missing-check.patch.txt](https://github.com/apache/lucene/files/14284577/VectorFieldSource.fix-missing-check.patch.txt)
   
   
   ### Version and environment details
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to