jpountz commented on code in PR #992: URL: https://github.com/apache/lucene/pull/992#discussion_r911190808
########## lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsWriter.java: ########## @@ -24,28 +24,40 @@ import org.apache.lucene.index.DocIDMerger; import org.apache.lucene.index.FieldInfo; import org.apache.lucene.index.MergeState; +import org.apache.lucene.index.Sorter; import org.apache.lucene.index.VectorValues; import org.apache.lucene.search.TopDocs; +import org.apache.lucene.util.Accountable; import org.apache.lucene.util.Bits; import org.apache.lucene.util.BytesRef; /** Writes vectors to an index. */ -public abstract class KnnVectorsWriter implements Closeable { +public abstract class KnnVectorsWriter implements Accountable, Closeable { /** Sole constructor */ protected KnnVectorsWriter() {} - /** Write all values contained in the provided reader */ - public abstract void writeField(FieldInfo fieldInfo, KnnVectorsReader knnVectorsReader) + /** Add new field for indexing */ + public abstract void addField(FieldInfo fieldInfo) throws IOException; + + /** Add new docID with its vector value to the given field for indexing */ + public abstract void addValue(FieldInfo fieldInfo, int docID, float[] vectorValue) + throws IOException; Review Comment: It's not great to need to pass the field on every value and require implementations to look up the right data structure on every doc. Should we add one more layer to the API to look more like this: ``` KnnFieldVectorsWriter { addValue(int docID, float[] vectorValue); } KnnVectorsWriter { KnnFieldVectorsWriter addField(FieldInfo info); flush(int maxDoc); // merge(), etc. } ``` ########## lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsWriter.java: ########## @@ -24,28 +24,40 @@ import org.apache.lucene.index.DocIDMerger; import org.apache.lucene.index.FieldInfo; import org.apache.lucene.index.MergeState; +import org.apache.lucene.index.Sorter; import org.apache.lucene.index.VectorValues; import org.apache.lucene.search.TopDocs; +import org.apache.lucene.util.Accountable; import org.apache.lucene.util.Bits; import org.apache.lucene.util.BytesRef; /** Writes vectors to an index. */ -public abstract class KnnVectorsWriter implements Closeable { +public abstract class KnnVectorsWriter implements Accountable, Closeable { /** Sole constructor */ protected KnnVectorsWriter() {} - /** Write all values contained in the provided reader */ - public abstract void writeField(FieldInfo fieldInfo, KnnVectorsReader knnVectorsReader) + /** Add new field for indexing */ + public abstract void addField(FieldInfo fieldInfo) throws IOException; + + /** Add new docID with its vector value to the given field for indexing */ + public abstract void addValue(FieldInfo fieldInfo, int docID, float[] vectorValue) + throws IOException; + + /** Flush all buffered data on disk * */ + public abstract void flush(int maxDoc, Sorter.DocMap sortMap) throws IOException; + + /** Write field for merging */ + public abstract void writeFieldForMerging(FieldInfo fieldInfo, KnnVectorsReader knnVectorsReader) Review Comment: Is it the same as `mergeXXXField` in `DocValuesConsumer` or `mergeOneField` in `PointsWriter`? Maybe we should rename to `mergeOneField` and make this method responsible for creating the merged view (instead of doing it on top)? ########## lucene/core/src/java/org/apache/lucene/codecs/perfield/PerFieldKnnVectorsFormat.java: ########## @@ -94,17 +95,61 @@ public KnnVectorsReader fieldsReader(SegmentReadState state) throws IOException private class FieldsWriter extends KnnVectorsWriter { private final Map<KnnVectorsFormat, WriterAndSuffix> formats; private final Map<String, Integer> suffixes = new HashMap<>(); + private final Map<KnnVectorsWriter, Collection<String>> writersForFields = + new IdentityHashMap<>(); private final SegmentWriteState segmentWriteState; + // if there is a single writer, cache it for faster indexing + private KnnVectorsWriter singleWriter; Review Comment: We should design the API in such a way that such tricks are not needed, I left a commen on `KnnVectorsWriter`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org