Hi Uwe
Thanks again for your feedback, I got it working now :-)
I am using a simplified version, which I will post below, such that it
might help others, at least as long as this implementation makes sense.
Btw, when a new version of Lucene gets released, how do I best find out
that "Lucene95Codec" is still the most recent default codec or that
there is a new default codec?
Thanks
Michael
---
@Autowired private LuceneCodecFactoryluceneCodecFactory;
IndexWriterConfig iwc =new IndexWriterConfig();
iwc.setCodec(luceneCodecFactory.getCodec());
----
package com.erkigsnek.webapp.services;
import org.apache.lucene.codecs.Codec;
import org.apache.lucene.codecs.KnnVectorsFormat;
import org.apache.lucene.codecs.KnnVectorsReader;
import org.apache.lucene.codecs.KnnVectorsWriter;
import org.apache.lucene.codecs.lucene95.Lucene95Codec;
import org.apache.lucene.codecs.lucene95.Lucene95HnswVectorsFormat;
import org.apache.lucene.index.SegmentReadState;
import org.apache.lucene.index.SegmentWriteState;
import org.springframework.stereotype.Component;
import lombok.extern.slf4j.Slf4j;
import java.io.IOException;
@Slf4j @Component public class LuceneCodecFactory {
private final int maxDimensions =16384;/** * */ public Codec getCodec() {
//return Lucene95Codec.getDefault(); log.info("Get codec ...");
Codec codec =new Lucene95Codec() {
@Override public KnnVectorsFormat
getKnnVectorsFormatForField(String field) {
var delegate =new Lucene95HnswVectorsFormat();
log.info("Maximum Vector Dimension: " +maxDimensions);
return new DelegatingKnnVectorsFormat(delegate,maxDimensions);
}
};
return codec;
}
}
/** * This class exists because Lucene95HnswVectorsFormat's
getMaxDimensions method is final and we * need to workaround that
constraint to allow more than the default number of dimensions */ @Slf4j
class DelegatingKnnVectorsFormatextends KnnVectorsFormat {
private final KnnVectorsFormatdelegate;
private final int maxDimensions;
public DelegatingKnnVectorsFormat(KnnVectorsFormat delegate,int
maxDimensions) {
super(delegate.getName());
this.delegate = delegate;
this.maxDimensions = maxDimensions;
}
@Override public KnnVectorsWriter fieldsWriter(SegmentWriteState
state)throws IOException {
return delegate.fieldsWriter(state);
}
@Override public KnnVectorsReader fieldsReader(SegmentReadState
state)throws IOException {
return delegate.fieldsReader(state);
}
@Override public int getMaxDimensions(String fieldName) {
log.info("Maximum vector dimension: " +maxDimensions);
return maxDimensions;
}
}
Am 19.10.23 um 11:23 schrieb Uwe Schindler:
Hi Michael,
The max vector dimension limit is no longer checked in the field type
as it is responsibility of the codec to enforce it.
You need to build your own codec that returns a different setting so
it can be enforced by IndexWriter. See Apache Solr's code how to wrap
the existing KnnVectorsFormat so it returns another limit:
<https://github.com/apache/solr/blob/6d50c592fb0b7e0ea2e52ecf1cde7e882e1d0d0a/solr/core/src/java/org/apache/solr/core/SchemaCodecFactory.java#L159-L183>
Basically you need to subclass Lucene95Codec like done here:
<https://github.com/apache/solr/blob/6d50c592fb0b7e0ea2e52ecf1cde7e882e1d0d0a/solr/core/src/java/org/apache/solr/core/SchemaCodecFactory.java#L99-L146>
and return a different vectors format like a delegator as descirbed
before.
The responsibility was shifted to the codec, because there may be
better alternatives to HNSW that have different limits especially with
regard to performance during merging and query response times, e.g.
BKD trees.
Uwe
Am 19.10.2023 um 10:53 schrieb Michael Wechner:
I forgot to mention, that when using the custom FieldType and 1536
vector dimension does work with Lucene 9.7.0
Thanks
Michael
Am 19.10.23 um 10:39 schrieb Michael Wechner:
Hi
I recently upgraded Lucene to 9.8.0 and was running tests with
OpenAI's embedding model, which has the vector dimension 1536 and
received the following error
Field[vector]vector's dimensions must be <= [1024]; got 1536
wheres this worked previously with the hack to override the vector
dimension using a custom
float[] vector = ...
FieldType vectorFieldType = new CustomVectorFieldType(vector.length,
VectorSimilarityFuncion.COSINE);
and setting
KnnFloatVectorField vectorField = new
KnnFloatVectorField("VECTOR_FIELD", vector, vectorFieldType);
But this does not seem to work anymore with Lucene 9.8.0
Is this hack now prevented by the Lucene code itself, or any idea
how to make this work again?
Whatever one thinks of OpenAI, the embedding model
"text-embedding-ada-002" is really good and it is sad, that one
cannot use it with Lucene, because of the 1024 dimension restriction.
Thanks
Michael
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org