From Ali Alsuliman <[email protected]>:

Attention is currently required from: Calvin Thomas Dani, Ian Maxon.

Ali Alsuliman has posted comments on this change by Calvin Thomas Dani. ( 
https://asterix-gerrit.ics.uci.edu/c/asterixdb/+/20126?usp=email )

Change subject: [ASTERIXDB-3676] Vector Distance KNN Function
......................................................................


Patch Set 19:

(17 comments)

File 
asterixdb/asterix-lang-common/src/main/java/org/apache/asterix/lang/common/util/CommonFunctionMapUtil.java:

https://asterix-gerrit.ics.uci.edu/c/asterixdb/+/20126/comment/969b42c3_461e403d?usp=email
 :
PS19, Line 174: addFunctionMapping("euclidean_dist", "euclidean-dist");
              :         addFunctionMapping("manhattan_dist", "manhattan-dist");
              :         addFunctionMapping("dot_dist", "dot-dist");
              :         addFunctionMapping("cosine_dist", "cosine-dist");
I don't think you need to add the ones having only '-' and '_' differences. 
They should be handled by some other code. Check and let me know.


File 
asterixdb/asterix-om/src/main/java/org/apache/asterix/om/types/ATypeTag.java:

https://asterix-gerrit.ics.uci.edu/c/asterixdb/+/20126/comment/ce9f9f2b_84339f97?usp=email
 :
PS19, Line 138: public boolean isNumericType() {
You can use ATypeHierarchy.getTypeDomain) == ATypeHierarchy.Domain.NUMERIC


File 
asterixdb/asterix-runtime/src/main/java/org/apache/asterix/runtime/evaluators/functions/vector/CosineDistanceArrDescriptor.java:

https://asterix-gerrit.ics.uci.edu/c/asterixdb/+/20126/comment/a38a9267_8470a031?usp=email
 :
PS19, Line 37: DescriptorFactoryUtil
             :             .createFactory(CosineDistanceArrDescriptor::new, 
FunctionTypeInferers.SET_ARGUMENTS_TYPE);
Typically the descriptors that implement type inferer also implement `public 
void setImmutableStates` to capture the arguments types and then pass them to 
the function evaluator. I don't see you need the arguments types. If not, then 
you can just simplify the descriptor factory as:
```
public final static IFunctionDescriptorFactory FACTORY = 
CosineDistanceArrDescriptor:new;
```


File 
asterixdb/asterix-runtime/src/main/java/org/apache/asterix/runtime/evaluators/functions/vector/VectorDistanceArrScalarEvaluator.java:

https://asterix-gerrit.ics.uci.edu/c/asterixdb/+/20126/comment/807501e7_af9904bb?usp=email
 :
PS19, Line 63: VectorDistanceArrScalarEvaluator
For all the instance variables, start by making them `private final` and then 
open up when needed. Also organize a bit so that they are next to each and 
interface declarations are next to each other.


https://asterix-gerrit.ics.uci.edu/c/asterixdb/+/20126/comment/5da6263f_a876ce49?usp=email
 :
PS19, Line 91: listAccessorConstant
I don't think you need two separate listAccessor & listAccessorConstant. You 
should be able to do it using only listAccessor where you create list accessors 
beforehand and then just re-use those objects.

Another note is you can also just simplify it by creating two variables instead 
of an array. This note is applicable to all the other array[2] variables and 
then you get rid of the for-loops and instead just refactor the code so that 
you can call it for arg1 and arg2. Give this a try and see how it looks. The 
goal is to make the code simpler.


https://asterix-gerrit.ics.uci.edu/c/asterixdb/+/20126/comment/ec309be2_94ce4175?usp=email
 :
PS19, Line 113: if 
(!ATYPETAGDESERIALIZER.deserialize(pointables[i].getByteArray()[pointables[i].getStartOffset()])
              :                         .isListType())
Does this mean it is going to throw an exception if you pass a NULL/MISSING 
constant arg? e.g. `euclidean(arg1, NULL);`


https://asterix-gerrit.ics.uci.edu/c/asterixdb/+/20126/comment/d379f6fd_01089963?usp=email
 :
PS19, Line 140: listAccessor[i] = new ListAccessor();
As I said, for variable arg, we should not create this list accessor each time 
we evaluate a tuple. You should be able to re-use the two list accessors that 
you created beforehand and just call `listAccessor.reset()`.


https://asterix-gerrit.ics.uci.edu/c/asterixdb/+/20126/comment/76b9a80c_2d6e203c?usp=email
 :
PS19, Line 160: PointableHelper.setNull(result);
We typically return MISSING if there is any arg MISSING, then NULL if there is 
any arg NULL which is taken care of by `checkAndSetMissingOrNull()`. Are you 
purposefully returning always NULL?


https://asterix-gerrit.ics.uci.edu/c/asterixdb/+/20126/comment/ff30fc10_ad09dd99?usp=email
 :
PS19, Line 174:             double[] primitiveArrayLeft =
              :                     isConstant[0] ? primitiveArrayConstant[0] : 
createArrayFromList(leftListAccessor, vectorArrays[0]);
              :             double[] primitiveArrayRight =
              :                     isConstant[1] ? primitiveArrayConstant[1] : 
createArrayFromList(rightListAccessor, vectorArrays[1]);
We need to handle the less common case where each evaluated tuple has different 
arguments length than the previous tuple, e.g.:
```
tuple1: arg1=[1,2,3], arg2=[1,2,3]
tuple2: arg1=[1,2,3,4], arg2=[1,2,3,4]
```
I believe this is less common, right? If yes, then we can handle it simply by 
resetting the double[] variables to a new double[] with the new length.


https://asterix-gerrit.ics.uci.edu/c/asterixdb/+/20126/comment/18dcd68e_111e5a0c?usp=email
 :
PS19, Line 190: result.set(resultStorage);
You can move this `result.set(resultStorage);` up inside the try-block after 
`writeResult(distanceCal, dataOutput);`


https://asterix-gerrit.ics.uci.edu/c/asterixdb/+/20126/comment/7ad92839_cea3ac4a?usp=email
 :
PS19, Line 194: AMutableDouble aDouble = new AMutableDouble(-1);
Same idea. You should be able to create this object only once beforehand and 
re-use it here instead of creating one for each evaluated tuple in the 
`evaluate()`


https://asterix-gerrit.ics.uci.edu/c/asterixdb/+/20126/comment/d25f04f8_c02d9516?usp=email
 :
PS19, Line 203: IPointable tempVal = new VoidPointable();
              :         ArrayBackedValueStorage storage = new 
ArrayBackedValueStorage();
Same idea. You should be able to create those objects only once beforehand and 
re-use them here instead of creating them for each evaluated tuple in the 
`evaluate()`


https://asterix-gerrit.ics.uci.edu/c/asterixdb/+/20126/comment/28cb07a1_d2ae51b8?usp=email
 :
PS19, Line 212: extractNumericVector
I believe that we can simplify it to something like:
```
byte[] data = pointable.getByteArray();
int offset = pointable.getStartOffset();
return switch (data[offset]) {
            case SERIALIZED_INT8_TYPE_TAG -> 
AInt8SerializerDeserializer.getByte(data, offset + 1);
            case SERIALIZED_INT16_TYPE_TAG -> 
AInt16SerializerDeserializer.getShort(data, offset + 1);
            case SERIALIZED_INT32_TYPE_TAG -> 
AInt32SerializerDeserializer.getInt(data, offset + 1);
            case SERIALIZED_INT64_TYPE_TAG -> 
AInt64SerializerDeserializer.getLong(data, offset + 1);
            case SERIALIZED_FLOAT_TYPE_TAG -> 
AFloatSerializerDeserializer.getFloat(data, offset + 1);
            case SERIALIZED_DOUBLE_TYPE_TAG -> 
ADoubleSerializerDeserializer.getDouble(data, offset + 1);
            default -> Double.NaN;
};
```


https://asterix-gerrit.ics.uci.edu/c/asterixdb/+/20126/comment/a3b8e231_5d38eb64?usp=email
 :
PS19, Line 221: DISTANCE_FN_MAP.get(func)
Why not use the funcId directly and remove this map?


File 
asterixdb/asterix-runtime/src/main/java/org/apache/asterix/runtime/utils/VectorDistanceArrCalculation.java:

https://asterix-gerrit.ics.uci.edu/c/asterixdb/+/20126/comment/674521ea_4a01c89a?usp=email
 :
PS19, Line 28: sum += diff * diff;
Is there a possibility that this could overflow/underflow? The same question 
for the other methods here.


https://asterix-gerrit.ics.uci.edu/c/asterixdb/+/20126/comment/464cd9ba_0f919c8b?usp=email
 :
PS19, Line 30: if (Double.isNaN(sum)) {
This check is redundant. Math.sqrt(sum) already returns NaN if sum is NaN.


https://asterix-gerrit.ics.uci.edu/c/asterixdb/+/20126/comment/c2e0140b_970d50b4?usp=email
 :
PS19, Line 37: double sum = 0.0;
             :         for (int i = 0; i < a.length; i++) {
             :             double diff = a[i] - b[i];
             :             sum += diff * diff;
             :         }
             :         if (Double.isNaN(sum)) {
             :             return Double.NaN; // Handle NaN case
             :         }
This part could be refactored with euclidean()



--
To view, visit https://asterix-gerrit.ics.uci.edu/c/asterixdb/+/20126?usp=email
To unsubscribe, or for help writing mail filters, visit 
https://asterix-gerrit.ics.uci.edu/settings?usp=email

Gerrit-MessageType: comment
Gerrit-Project: asterixdb
Gerrit-Branch: master
Gerrit-Change-Id: I45176e7da13ed222f48bb36bd1f0ab4074a0bcb9
Gerrit-Change-Number: 20126
Gerrit-PatchSet: 19
Gerrit-Owner: Calvin Thomas Dani <[email protected]>
Gerrit-Reviewer: Ali Alsuliman <[email protected]>
Gerrit-Reviewer: Anon. E. Moose #1000171
Gerrit-Reviewer: Ian Maxon <[email protected]>
Gerrit-Reviewer: Jenkins <[email protected]>
Gerrit-Attention: Calvin Thomas Dani <[email protected]>
Gerrit-Attention: Ian Maxon <[email protected]>
Gerrit-Comment-Date: Wed, 18 Feb 2026 13:41:34 +0000
Gerrit-HasComments: Yes
Gerrit-Has-Labels: No

Reply via email to