Re: Field[vector]vector's dimensions must be <= [1024]; got 1536

Uwe Schindler Wed, 08 Nov 2023 00:26:09 -0800

Hi Michael,

The version below looks correct. Of course the Solr version is able todo much more. The code you posted limits it to the bare minimum:


 * subclass default codec
 * implement getKnnVectorsFormatForField() and return the wrapper with
   other max dimension

Reading indexes still works with unmodified default codec, you only needto set it for IndexWriter. When reading the actual codec is looked up byname.


Uwe

Am 07.11.2023 um 17:03 schrieb Michael Wechner:

Hi Uwe

Thanks again for your feedback, I got it working now :-)
I am using a simplified version, which I will post below, such that itmight help others, at least as long as this implementation makes sense.
Btw, when a new version of Lucene gets released, how do I best findout that "Lucene95Codec" is still the most recent default codec orthat there is a new default codec?
Thanks

Michael

---

@Autowired private LuceneCodecFactoryluceneCodecFactory;

IndexWriterConfig iwc =new IndexWriterConfig();
iwc.setCodec(luceneCodecFactory.getCodec());

----

package com.erkigsnek.webapp.services;

import org.apache.lucene.codecs.Codec;
import org.apache.lucene.codecs.KnnVectorsFormat;
import org.apache.lucene.codecs.KnnVectorsReader;
import org.apache.lucene.codecs.KnnVectorsWriter;
import org.apache.lucene.codecs.lucene95.Lucene95Codec;
import org.apache.lucene.codecs.lucene95.Lucene95HnswVectorsFormat;
import org.apache.lucene.index.SegmentReadState;
import org.apache.lucene.index.SegmentWriteState;
import org.springframework.stereotype.Component;
import lombok.extern.slf4j.Slf4j;

import java.io.IOException;

@Slf4j @Component public class LuceneCodecFactory {
private final int maxDimensions =16384;/** * */ public CodecgetCodec() {
        //return Lucene95Codec.getDefault(); log.info("Get codec ...");
        Codec codec =new Lucene95Codec() {
@Override public KnnVectorsFormatgetKnnVectorsFormatForField(String field) {
                var delegate =new Lucene95HnswVectorsFormat();
                log.info("Maximum Vector Dimension: " +maxDimensions);
return newDelegatingKnnVectorsFormat(delegate,maxDimensions);
            }
        };

        return codec;
    }
}
/** * This class exists because Lucene95HnswVectorsFormat'sgetMaxDimensions method is final and we * need to workaround thatconstraint to allow more than the default number of dimensions */@Slf4j class DelegatingKnnVectorsFormatextends KnnVectorsFormat {
    private final KnnVectorsFormatdelegate;
    private final int maxDimensions;
public DelegatingKnnVectorsFormat(KnnVectorsFormat delegate,intmaxDimensions) {
        super(delegate.getName());
        this.delegate = delegate;
        this.maxDimensions = maxDimensions;
    }
@Override public KnnVectorsWriter fieldsWriter(SegmentWriteStatestate)throws IOException {
        return delegate.fieldsWriter(state);
    }
@Override public KnnVectorsReader fieldsReader(SegmentReadStatestate)throws IOException {
        return delegate.fieldsReader(state);
    }

    @Override public int getMaxDimensions(String fieldName) {
        log.info("Maximum vector dimension: " +maxDimensions);
        return maxDimensions;
    }
}






Am 19.10.23 um 11:23 schrieb Uwe Schindler:
Hi Michael,
The max vector dimension limit is no longer checked in the field typeas it is responsibility of the codec to enforce it.
You need to build your own codec that returns a different setting soit can be enforced by IndexWriter. See Apache Solr's code how to wrapthe existing KnnVectorsFormat so it returns another limit:<https://github.com/apache/solr/blob/6d50c592fb0b7e0ea2e52ecf1cde7e882e1d0d0a/solr/core/src/java/org/apache/solr/core/SchemaCodecFactory.java#L159-L183>
Basically you need to subclass Lucene95Codec like done here:<https://github.com/apache/solr/blob/6d50c592fb0b7e0ea2e52ecf1cde7e882e1d0d0a/solr/core/src/java/org/apache/solr/core/SchemaCodecFactory.java#L99-L146>and return a different vectors format like a delegator as descirbedbefore.
The responsibility was shifted to the codec, because there may bebetter alternatives to HNSW that have different limits especiallywith regard to performance during merging and query response times,e.g. BKD trees.
Uwe

Am 19.10.2023 um 10:53 schrieb Michael Wechner:
I forgot to mention, that when using the custom FieldType and 1536vector dimension does work with Lucene 9.7.0
Thanks

Michael



Am 19.10.23 um 10:39 schrieb Michael Wechner:
Hi
I recently upgraded Lucene to 9.8.0 and was running tests withOpenAI's embedding model, which has the vector dimension 1536 andreceived the following error
Field[vector]vector's dimensions must be <= [1024]; got 1536
wheres this worked previously with the hack to override the vectordimension using a custom
float[] vector = ...
FieldType vectorFieldType = newCustomVectorFieldType(vector.length, VectorSimilarityFuncion.COSINE);
and setting
KnnFloatVectorField vectorField = newKnnFloatVectorField("VECTOR_FIELD", vector, vectorFieldType);
But this does not seem to work anymore with Lucene 9.8.0
Is this hack now prevented by the Lucene code itself, or any ideahow to make this work again?
Whatever one thinks of OpenAI, the embedding model"text-embedding-ada-002" is really good and it is sad, that onecannot use it with Lucene, because of the 1024 dimension restriction.
Thanks

Michael



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail:u...@thetaphi.de

Re: Field[vector]vector's dimensions must be <= [1024]; got 1536

Reply via email to