[
https://issues.apache.org/jira/browse/SOLR-17487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17888717#comment-17888717
]
Matthias Krueger commented on SOLR-17487:
-----------------------------------------
I just tried to reproduce this locally using your 384 and 768 JSON vector
examples and they post fine in the admin UI and in a SolrTestCaseJ4. Can you
reproduce the issue outside the Solr admin UI?
> Can't POST a dense vector that contains two or more occurences of the same
> float value
> --------------------------------------------------------------------------------------
>
> Key: SOLR-17487
> URL: https://issues.apache.org/jira/browse/SOLR-17487
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Components: UpdateRequestProcessors
> Affects Versions: 9.7, 9.6.1
> Reporter: Guillaume Jactat
> Priority: Major
> Attachments: image-2024-10-10-18-05-01-195.png,
> image-2024-10-10-18-07-14-904.png, image-2024-10-10-18-07-19-370.png,
> image-2024-10-10-23-27-26-566.png, vector-384.json, vector-384.xml,
> vector-768.json
>
>
> *EDIT 10/10/2024* :
> After a detailed analysis of the problematic vectors, I found that the
> “missing” dimensions were actually dimensions of the same value.
> In concrete terms, the values present several times in the posted vectors are
> deduplicated by Solr.
> You can see for yourself that the vectors supplied as attachments have the
> common characteristic of containing {*}two or more occurences of the very
> same float value{*}. The embedding model I use (all-minilm:33m) seems to
> generate many such cases.
> It seems that {*}Solr only takes into account the first occurrence of these
> values{*}. As a result, the length of the final vector is no longer correct.
> The following screenshot show exactly what happens. With a smaller vector
> field type of size 5. We can see that the vector [1, 5, 3, 4, 5] becomes [1,
> 5, 3, 4].
> !image-2024-10-10-23-27-26-566.png!
>
> ---------------------------------------------
> Hello,
>
> I'm using Solr 9.7 as a vector database. I've come across something I can't
> explain : I POST my documents as JSON and I've got a vector field of
> dimension {*}768{*}.
>
> The JSON document I POST has a vector field, which is an array of length 768.
> Each value is a float.
>
> Solr complains that my array is only *767* long...
> I've compared the JSON I POST and the array parsed by Solr and written in the
> logs.... And indeed, one of the 768 values has simply disappeared in the
> process.
>
> The problem can easily be reproduced. All you have to do is :
> * In your "schema.xml", declare the following dense vector field type :
> {code:java}
> <fieldType name="knn_vector_768" class="solr.DenseVectorField"
> vectorDimension="768" similarityFunction="cosine"/>{code}
> * In your schema.xml, declare the followig dense vector dynamic field :
> {code:java}
> <dynamicField name="*_vector_768" type="knn_vector_768" indexed="true"
> stored="true"/>{code}
> * Use the Solr Admin UI to post the *attached document* to your Solr core.
> * You should get the following error : "{*}incorrect vector dimension. The
> vector value has size 767 while it is expected a vector with size 768"{*}
>
> * Furthermore, while the POSTed vector has 768 size, the vector written in
> the logs is only 767... One value is missing. You can easily spot the missing
> value with a simple diff.
> Maybe someone will find the reason why this specific vector leads to this
> issue. Of course, I have plenty of others documents that get indexed without
> any issue.
> In case it helps, the value that disappears from the 768 vector is
> "0.0335415453". It's the 384th dimension (starting from 1)
> !image-2024-10-10-18-07-19-370.png!
> Thanks for reading
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]