[GitHub] [lucene] jpountz commented on a diff in pull request #992: LUCENE-10592 Build HNSW Graph on indexing

GitBox Thu, 30 Jun 2022 09:18:41 -0700


jpountz commented on code in PR #992:
URL: https://github.com/apache/lucene/pull/992#discussion_r911190808



##########
lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsWriter.java:
##########
@@ -24,28 +24,40 @@
 import org.apache.lucene.index.DocIDMerger;
 import org.apache.lucene.index.FieldInfo;
 import org.apache.lucene.index.MergeState;
+import org.apache.lucene.index.Sorter;
 import org.apache.lucene.index.VectorValues;
 import org.apache.lucene.search.TopDocs;
+import org.apache.lucene.util.Accountable;
 import org.apache.lucene.util.Bits;
 import org.apache.lucene.util.BytesRef;
 
 /** Writes vectors to an index. */
-public abstract class KnnVectorsWriter implements Closeable {
+public abstract class KnnVectorsWriter implements Accountable, Closeable {
 
   /** Sole constructor */
   protected KnnVectorsWriter() {}
 
-  /** Write all values contained in the provided reader */
-  public abstract void writeField(FieldInfo fieldInfo, KnnVectorsReader 
knnVectorsReader)
+  /** Add new field for indexing */
+  public abstract void addField(FieldInfo fieldInfo) throws IOException;
+
+  /** Add new docID with its vector value to the given field for indexing */
+  public abstract void addValue(FieldInfo fieldInfo, int docID, float[] 
vectorValue)
+      throws IOException;

Review Comment:
   It's not great to need to pass the field on every value and require 
implementations to look up the right data structure on every doc. Should we add 
one more layer to the API to look more like this:
   
   ```
   KnnFieldVectorsWriter {
     addValue(int docID, float[] vectorValue);
   }
   
   KnnVectorsWriter {
     KnnFieldVectorsWriter addField(FieldInfo info);
     flush(int maxDoc);
     // merge(), etc.
   }
   ```



##########
lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsWriter.java:
##########
@@ -24,28 +24,40 @@
 import org.apache.lucene.index.DocIDMerger;
 import org.apache.lucene.index.FieldInfo;
 import org.apache.lucene.index.MergeState;
+import org.apache.lucene.index.Sorter;
 import org.apache.lucene.index.VectorValues;
 import org.apache.lucene.search.TopDocs;
+import org.apache.lucene.util.Accountable;
 import org.apache.lucene.util.Bits;
 import org.apache.lucene.util.BytesRef;
 
 /** Writes vectors to an index. */
-public abstract class KnnVectorsWriter implements Closeable {
+public abstract class KnnVectorsWriter implements Accountable, Closeable {
 
   /** Sole constructor */
   protected KnnVectorsWriter() {}
 
-  /** Write all values contained in the provided reader */
-  public abstract void writeField(FieldInfo fieldInfo, KnnVectorsReader 
knnVectorsReader)
+  /** Add new field for indexing */
+  public abstract void addField(FieldInfo fieldInfo) throws IOException;
+
+  /** Add new docID with its vector value to the given field for indexing */
+  public abstract void addValue(FieldInfo fieldInfo, int docID, float[] 
vectorValue)
+      throws IOException;
+
+  /** Flush all buffered data on disk * */
+  public abstract void flush(int maxDoc, Sorter.DocMap sortMap) throws 
IOException;
+
+  /** Write field for merging */
+  public abstract void writeFieldForMerging(FieldInfo fieldInfo, 
KnnVectorsReader knnVectorsReader)

Review Comment:
   Is it the same as `mergeXXXField` in `DocValuesConsumer` or `mergeOneField` 
in `PointsWriter`? Maybe we should rename to `mergeOneField` and make this 
method responsible for creating the merged view (instead of doing it on top)?



##########
lucene/core/src/java/org/apache/lucene/codecs/perfield/PerFieldKnnVectorsFormat.java:
##########
@@ -94,17 +95,61 @@ public KnnVectorsReader fieldsReader(SegmentReadState 
state) throws IOException
   private class FieldsWriter extends KnnVectorsWriter {
     private final Map<KnnVectorsFormat, WriterAndSuffix> formats;
     private final Map<String, Integer> suffixes = new HashMap<>();
+    private final Map<KnnVectorsWriter, Collection<String>> writersForFields =
+        new IdentityHashMap<>();
     private final SegmentWriteState segmentWriteState;
 
+    // if there is a single writer, cache it for faster indexing
+    private KnnVectorsWriter singleWriter;

Review Comment:
   We should design the API in such a way that such tricks are not needed, I 
left a commen on `KnnVectorsWriter`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jpountz commented on a diff in pull request #992: LUCENE-10592 Build HNSW Graph on indexing

Reply via email to