benwtrent commented on issue #12497:
URL: https://github.com/apache/lucene/issues/12497#issuecomment-1677250336

   > have we considered choosing an initial quantization based on the first 
segment seen and using that for all subsequent segments?
   
   This may not be possible. What if they flush the first segment only after 10 
docs? We need a representative random sample for this to at least work. I am 
not sure how to make sure that happens.
   
   >  providing an API where quantization parameters can be provided?
   
   This assumes that the user already knows about their vectors (or at least 
many of them) and built quantization well ahead of time. 
   
   Having it in segments directly allows the data to evolve over time without 
users having to worry about it.
   
   IMO, this is no different than somebody just doing quantization outside of 
Lucene and indexing the byte values directly. 
   
   > These kinds of approaches would seem to offer increased benefits in that 
we would not need to store the original vectors (since re-quantization would 
never be required)
   
   I don't think these ideas make any such guarantees. We just don't bother 
with doing any work for the user. And thus just say "y'all handle it user". 
Either by making sure they give us a nice representative random sample in the 
first segment or by just quantizing outside of lucene.
   
   > Looking at product-quantization schemes based on kmeans clustering this 
seems to be the usual approach
   
   Correct, using Lloyd's algorithm. As the dataset becomes stable as more data 
is seen, Lloyds could run more efficiently as better initial centroids are 
known from previous runs. It lends itself to distributed work over time.
   
   But, I want to focus on scalar quantization first if possible :).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to