[ 
https://issues.apache.org/jira/browse/SOLR-18207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18078614#comment-18078614
 ] 

Sanjay Dutt commented on SOLR-18207:
------------------------------------

The proposed Lucene data-blind quantization feature 
(https://github.com/apache/lucene/issues/16029) would allow codecs to drop raw 
float vectors internally, trading storage savings for the inability to 
re-quantize from the original input. Although this feature is not available 
yet, I am trying to evaluate whether it is still valuable to add a Solr-side 
option to disable storing raw vectors as StoredFields.

Today, such a Solr option can reduce duplicate storage because Lucene may still 
preserve raw vectors internally. However, with data-blind quantization, that 
assumption may no longer hold. If Lucene drops raw vectors internally and Solr 
also disables StoredField storage, then raw vector retrieval from Solr would no 
longer be available.

So the Solr feature may still be useful, but it probably needs to clearly 
define whether raw vectors are preserved somewhere, or whether retrieval is 
intentionally unsupported.

> Add derived stored retrieval for DenseVectorField to avoid duplicate vector 
> storage
> -----------------------------------------------------------------------------------
>
>                 Key: SOLR-18207
>                 URL: https://issues.apache.org/jira/browse/SOLR-18207
>             Project: Solr
>          Issue Type: Task
>          Components: vector-search
>            Reporter: Sanjay Dutt
>            Priority: Minor
>              Labels: pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Solr DenseVectorField currently stores vector data twice when stored="true": 
> once in Lucene’s vector index for kNN/search and again in stored fields for 
> retrieval. This increases index size significantly for large vector workloads.
> This change adds an opt-in mode for DenseVectorField that preserves 
> stored-field semantics for normal document retrieval while avoiding the 
> redundant stored-field copy of the vector payload. Instead, Solr reconstructs 
> the returned vector value from Lucene vector data at fetch time.
> Key points:
>  * Adds an opt-in field type/property for derived vector retrieval.
>  * Avoids writing redundant stored vector bytes at index time.
>  * Extends document fetch to materialize vector values from Lucene vector 
> readers.
>  * Keeps existing behavior unchanged unless the new option is enabled.
>  * Documents the fetch-time tradeoff and recommends caution for hot paths 
> that return vectors frequently, especially fl=*.
> {code:java}
> <fieldType name="knn_vector_derived"
>            class="solr.DenseVectorField"
>            vectorDimension="1024"
>            similarityFunction="cosine"
>            knnAlgorithm="hnsw"
>            indexed="true"
>            useVectorValuesAsStored="true"/>{code}
> Initial scope:
>  * Single-valued vector fields only.
>  * Multivalued derived vector retrieval is not supported in this change.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to