[
https://issues.apache.org/jira/browse/SOLR-17164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17824857#comment-17824857
]
Chris M. Hostetter commented on SOLR-17164:
-------------------------------------------
Sanjay: I REALLY like your patch, thank you for working on this.
Some feedback...
* By the time {{handle4ArgsVariant()}} is called, we already know what
{{arg1Str}} and {{arg2Str}} are...
** so we should call them by more descriptive names in that method: {{private
ValueSource handle4ArgsVariant(FunctionQParser fp, String vecEncStr, String
vecSimStr)}}
** Likewise for the argument names in {{handle2ArgsVariant()}}
* Passing {{constVec}} to {{handle2ArgsVariant()}} seems redundant and a tad
confusing?
** By the time we call {{handle2ArgsVariant()}} if {{constVec}} is true, then
{{arg2Str}} – aka {{field2Name}} – must be null
*** If {{arg2Str}} is null but, {{constVec}} was false, then we're in a weird
error situation
** So i think the method sig should be: {{private ValueSource
handle2ArgsVariant(FunctionQParser fp, String field1Name, String field2Name)}}
*** And if {{field2Name}} is null, then we know we should ask {{fp}} to parse
a constant vec
* The error messages thrown by {{requireVectorType}} should really include the
field _name_
** I would suggest refactoring the method to take in the {{SchemaField}} and
return the {{DenseVectorField}} – using the {{SchemaField.getName()}} for the
error if the type is wrong and the cast can't be done...
{code:java}
private DenseVectorField requireVectorType(SchemaField field) throws
SyntaxError {
FieldType fieldType = field.getType()
if ( fieldType instanceof DenseVectorField ) {
return (DenseVectorField) field.getType();
}
throw new SolrException(
BAD_REQUEST,
String.format(
Locale.ROOT,
"Type mismatch: Expected [%s], but found a different field type for
field: [%s]",
DenseVectorField.class.getSimpleName(), field.getName()));
}
{code}
* {{handle2ArgsVariant()}} should check that the vector encoding & similarity
from {{field1Type}} match those from {{field2Type}}
** and we should have a test that we get an error if someone tries to mix and
match
** Either that, or ... *_MAYBE_* leave the code alone, but the refguide should
make it clear that when two fields are specified, the encoding & similarity are
taken from the first field
*** But i haven't thought this idea through all the way ... not sure it's a
good idea to let people get different results for {{vectorSimilarity(fieldX,
fieldY)}} then they get from {{vectorSimilarity(fieldY, fieldX)}}
*** If folks want to mix and match encodings & similarity functions, then it
doesn't seem weird to force them to use the 4 arg version
* I like the mock test you've added, but the pattern repeated in each of the
test methods (with a {{new FunctionQParser}} wrapped around the output of
{{truncatePrefixFunctionQName}} is awkward and hard to make sense of at first.
** Other then the input string, the args to the {{FunctionQParser}}
constructor are all consistent across all test methods (and if any test needs
to tweak them they are already mocked) as is the (static)
{{VectorSimilaritySourceParser}}
** So why not refactor all of this logic into a helper method...
{code:java}
protected ValueSource parseWithMocks(final String input) throws SyntaxError {
FunctionQParser fqp = new
FunctionQParser(input.substring("vectorSimilarity(".length()), localParams,
params, request);
return vecSimilarity.parse(fqp)
}
{code}
* Your error test coverage of the 0 args, 1 arg, and 2 args (when either arg
aren't vectors) bad input situations seem good – can you please add some tests
of theerror message return when 3 args are used? .... Looking at the code I'm
honestly not sure what happens if...
** first arg is a vector field, second arg is a constant vec, but there is a
third useless arg
** first two args are an encoding & sim function, but only one vector argument
is provided (either field name or constant)
* Can you please add some tests where a constant vector is provided using a
request param reference
** ie: {{vectorSimilarity(fieldName,$vec_param)}}
** (I'm not sure off the top of my head if param refs are resolved by the time
we'll be making the {{peek()}} call)
* Can you please update {{QueryEqualityTest.testFuncKnnVector()}} to check
that given (consistent inputs) the queries returned by the 4 arg version are
equal to the queries returned by the 2 arg version
** There a variant of the {{assertFuncEquals()}} method that takes a
{{SolrQueryRequest}} as it's first argument which can be useful for checking
the request param references are working at the same time...
{code:java}
SolrQueryRequest req = req("someVec", "[1,2,3,4]");
try {
assertFuncEquals(req,
"vectorSimilarity(FLOAT32,COSINE,float_vec_field,[1,2,3,4])",
"vectorSimilarity(float_vec_field,[1,2,3,4])",
"vectorSimilarity(FLOAT32,COSINE,float_vec_field,$someVec)",
"vectorSimilarity(float_vec_field,$someVec)",
)
} finally {
req.close();
}
{code}
* In the ref-guide additions you made, you say {{The first argument must be a
Knn vector, and the second argument can be either a Knn vector or a constant
vector.}} but that's kind of missleading/confusing?
** what is "Knn vector" in this context?
** It should be clear that it requires a "DenseVectorField name"
> Add 2 arg variant of vectorSimilarity() function
> ------------------------------------------------
>
> Key: SOLR-17164
> URL: https://issues.apache.org/jira/browse/SOLR-17164
> Project: Solr
> Issue Type: Improvement
> Security Level: Public(Default Security Level. Issues are Public)
> Reporter: Chris M. Hostetter
> Priority: Major
> Labels: vector-based-search
> Time Spent: 10m
> Remaining Estimate: 0h
>
> Solr's current 4 argument
> {{vectorySimilarity(vectorEncoding,similarityFunction,vec1,vec2)}} function
> is really awkward to use for the (seemingly common) situation where you just
> want to know the similarity between a field and a constant vector, or
> (probably less common) between two fields of the same type.
> The first two (currently) mandatory arguments to {{vectorySimilarity()}}
> ({{{}vectorEncoding{}}} and {{{}similarityFunction{}}}) are already mandatory
> properties of {{{}DenseVectorField{}}}. IIUC the only reason these arguments
> are required is in the (seemingly uncommon?) situation where you might want
> to compute the similarity of two vector constants, so the function needs to
> know what {{vectorEncoding}} and {{similarityFunction}} to use.
>
> ----
>
> It would be really nice to support a simplified 2 argument variant of
> {{vectorySimilarity()}} such that:
> * the first argument must be the name of a {{DenseVectorField}} field
> * the second argument must be either:
> ** A vector constant
> *** in which case the {{vectorEncoding}} use to parse the constant is
> infered from the fieldType properties of the first argument
> ** Or the name of a second {{DenseVectorField}} field
> *** in which case the {{vectorEncoding}} and {{similarityFunction}} of the
> two fields must match
> * The ValueSource returned should be based on the configured
> {{vectorEncoding}} & {{similarityFunction}} of the field(s)
> Examples...
> {noformat}
> vectorySimilarity(title_float_vec_dim4, [1.0,2.0,3.0,4.0])
> ...or...
> vectorySimilarity(title_float_vec_dim4, body_float_vec_dim4)
> {noformat}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]