Re: [PR] De-dup raw vectors? [lucene]

via GitHub Wed, 19 Nov 2025 20:20:28 -0800


kaivalnp commented on PR #15440:
URL: https://github.com/apache/lucene/pull/15440#issuecomment-3555694648


   ### File Layout
   
   Today, the `.vec` file is partitioned per-field, and looks like:
   
   ```
   # field 1 begin:  
   (vector for field 1, document d1) # position x0  
   (vector for field 1, document d2)  
   (vector for field 1, document d3)  
   # field 1 end, field 2 begin:  
   (vector for field 2, document d1) # position x1  
   (vector for field 2, document d3)  
   # field 2 end, field 3 begin:  
   (vector for field 3, document d1) # position x2  
   (vector for field 3, document d2)  
   # field 3 end, and so on...
   ```
   
   The `.vem` file contains per-field tuples to denote `(position, length)` of 
the corresponding vector "block":
   
   ```
   # (field number, offset of vector "block", length of vector "block", ...)  
   # "..." represents other metadata, including dimension, ord -> doc mapping, 
etc.
   (1, x0, x1 - x0, ...)  
   (2, x1, x2 - x1, ...)  
   # and so on...  
   ```
   
   Proposing to change the `.vec` file to be partitioned per-document instead, 
something like:
   
   ```
   # document d1 begin:  
   (vector for field 1, document d1) # position x0  
   (vector for field 2, document d1) # position x1  
   (vector for field 3, document d1) # position x2  
   # document d1 end, document 2 begin:  
   (vector for field 1, document d2) # position x3  
   (vector for field 3, document d2) # position x4  
   # document d2 end, document 3 begin:  
   (vector for field 1, document d3) # position x5  
   (vector for field 2, document d3) # position x6  
   # document d3 end, and so on...
   ```
   
   Correspondingly, the `.vem` file will contain per-field mappings of `ord -> 
position of vector` in the raw file:
   
   ```
   # (field number, ord -> position mapping as array, ...)  
   # "..." represents other metadata, including dimension, ord -> doc mapping, 
etc. which is unchanged  
   (1, [x0, x3, x5], ...) # {ord 0 -> position x0, ord 1 -> position x3, ord 2 
-> position x5}  
   (2, [x1, x4], ...) # {ord 0 -> position x1, ord 1 -> position x4}  
   (3, [x2, x6], ...) # {ord 0 -> position x2, ord 1 -> position x6}  
   # and so on...
   ```
   
   In case of duplicate vectors _within_ a document, we can simply "point" to a 
pre-existing vector, without writing another copy on disk!
   
   Earlier, the offset of the vector at ordinal `ord` in field `f` was 
calculated by seeking to `ord * vectorByteSize` _inside_ the vector "block" of 
field `f`.
   
   Now, we're storing an additional `ord -> position of vector` mapping to 
"point" to the vector in the raw vector file, also used during search.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] De-dup raw vectors? [lucene]

Reply via email to