Raghav Jindal has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/23730 )

Change subject: IMPALA-14566: Add euclidean_distance and cosine_similarity 
functions for ARRAY<FLOAT>
......................................................................


Patch Set 9:

Changes added as suggested by reviewer in Patch 9
-> Replaced memcpy with pointer dereference . Casting tuple_ptr + slot_offset 
to float* and dereferencing . This is simpler , single line code and should be 
faster .
-> Fixed overflow in squared difference calculation by computing diff in double 
to avoid overflow when squaring .
-> Removed comment for helper function declared in private and added an info 
comment on the task of the helper function.
-> Removed the general format explaination and added a comment specifying the 
functions for semantic search.

I manually build Impala on my docker container running on one of the Ubuntu vm 
and when I added my code and tried to test it , I saw a list of issues in 
multiple files and try to fix them one by one . Issues seen and fix added :-

1) Catalog D failing to start with RuntimeError: Expected 3 impalad(s), only 0 
found
catalogd failed to start. When I checked the log file , I saw Could not find 
symbol 
'_ZN6impala15VectorFunctions17EuclideanDistanceEPN10impala_udf15FunctionContextERKNS1_13CollectionValES6_'
 error . Other ExprsIr symbols (e.g., MathFunctions) were present, but 
VectorFunctions symbols were missing. Added the include and calls to 
InitBuiltinsDummy() as part of scalar-expr-evaluator.cc file.

2) After the above change , when I rebuild (cmake) I was able to start the 
impala cluster including catalogd and also verified euclidean function 
registered by manually enetering impala shell but saw an issue while executing 
the select query . [localhost:21050] default> SELECT euclidean_distance(vec1, 
vec2) FROM test_vectors_simple;
Query: SELECT euclidean_distance(vec1, vec2) FROM test_vectors_simple
Query submitted at: 2025-12-13 14:46:01 (Coordinator: http://bf70f323bd8b:25000)
2025-12-13 14:46:02 [Exception]  ERROR: Query 044277bc307555a0:e19b0a4400000000 
failed:
AnalysisException: No matching function with signature: 
euclidean_distance(ARRAY<FLOAT>, ARRAY<FLOAT>). ArrayType didn't override 
matchesType() so function matching always failed for array parameters . Added 
matchesType() override to ArrayType .

3) [localhost:21050] default> SELECT euclidean_distance(vec1, vec2) FROM 
test_vectors_parquet;
Query: SELECT euclidean_distance(vec1, vec2) FROM test_vectors_parquet
Query submitted at: 2025-12-13 15:20:59 (Coordinator: http://bf70f323bd8b:25000)
Query state can be monitored at: 
http://bf70f323bd8b:25000/query_plan?query_id=684b9f46758684ab:551cc03a00000000
2025-12-13 15:21:01 [Exception]  ERROR: Query 684b9f46758684ab:551cc03a00000000 
failed:
Builtin 'euclidean_distance' with symbol 
'_ZN6impala15VectorFunctions17EuclideanDistanceEPN10impala_udf15FunctionContextERKNS1_13CollectionValES6_'
 does not exist. Verify that all your impalads are the same version.
Fragment failed during codegen, fragment index: F01 . The symbol wasn't found 
during codegen , added vector-functions-ir.cc include to impala-ir.cc .

4) Connection Reset Error , crashed and got disconnected running the below query
[localhost:21050] default> SELECT euclidean_distance(vec1, vec2) FROM 
test_vectors_parquet;
Query: SELECT euclidean_distance(vec1, vec2) FROM test_vectors_parquet
Query submitted at: 2025-12-13 14:58:13 (Coordinator: http://bf70f323bd8b:25000)
Query state can be monitored at: 
http://bf70f323bd8b:25000/query_plan?query_id=4b44d5a3833d8afc:8e00b52c00000000
2025-12-13 14:58:19 [Exception] type=<class 'ConnectionResetError'> in 
GetOperationStatus.  [Errno 104] Connection reset by peer
2025-12-13 14:58:19 [Exception] Socket error104 [Errno 104] Connection reset by 
peer
Saw the below error for anyval-util.h and anyval-util.cc in impala logs .
Store to codegen cache succeeded. CodeGen Cache Key 
hash_code=f77f2b8197a7acfb:f67765bed3b3095b
F20251213 15:26:43.001370 1265839 anyval-util.h:207] 
444942f08596faa1:03df497500000000] Check failed: false ARRAY . Added the 
changes for anyval-util.h , anyvalutil.cc and udf.h to fix the "Unknown type: 
ARRAY" error .

5) Few challenges faced in adding E2E test cases for table format , array 
reserved key word , parallel race conditions bcz of instances overwritting the 
same table , one instance dropping table when other is querying it. Creating 
and inserting table through hive and then running queries through impala shell 
as I tested one case manually and then tested my e2e test cases present in 
vector-distance-functions.test file by running the cmd ( ./tests/run-tests.py 
query_test/test_exprs.py::TestExprs::test_vector_distance_functions -v -s ) and 
verifying tests passed coming in the output.

Query: SELECT euclidean_distance(vec1, vec2) FROM test_vectors_simple
Query submitted at: 2025-12-13 14:54:17 (Coordinator: http://bf70f323bd8b:25000)
2025-12-13 14:54:19 [Exception]  ERROR: Query e84abec2f975ac5d:bd1e87a400000000 
failed:
NotImplementedException: Scan of table 'default.test_vectors_simple' in format 
'TEXT' is not supported because the table has a column 'vec1' with a complex 
type 'ARRAY<FLOAT>'.
Complex types are supported for these file formats: PARQUET, ORC, HUDI_PARQUET, 
PAIMON.

I see impala jenkins still failing but unable to access the url . See on slack 
that this is a recent change and checking with laszlo to see if I can get 
access to check and debug the cause of failure of jenkins.


--
To view, visit http://gerrit.cloudera.org:8080/23730
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Id305acc87530d5d0e53613fe8df9a631ea4e1080
Gerrit-Change-Number: 23730
Gerrit-PatchSet: 9
Gerrit-Owner: Raghav Jindal <[email protected]>
Gerrit-Reviewer: Abhishek Rawat <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Michael Smith <[email protected]>
Gerrit-Reviewer: Pranav Lodha <[email protected]>
Gerrit-Reviewer: Raghav Jindal <[email protected]>
Gerrit-Reviewer: Yida Wu <[email protected]>
Gerrit-Comment-Date: Mon, 15 Dec 2025 19:14:33 +0000
Gerrit-HasComments: No

Reply via email to