Raghav Jindal has posted comments on this change. ( http://gerrit.cloudera.org:8080/23730 )
Change subject: IMPALA-14566: Add euclidean_distance and cosine_similarity functions for ARRAY<FLOAT> ...................................................................... Patch Set 9: Changes added as suggested by reviewer in Patch 9 -> Replaced memcpy with pointer dereference . Casting tuple_ptr + slot_offset to float* and dereferencing . This is simpler , single line code and should be faster . -> Fixed overflow in squared difference calculation by computing diff in double to avoid overflow when squaring . -> Removed comment for helper function declared in private and added an info comment on the task of the helper function. -> Removed the general format explaination and added a comment specifying the functions for semantic search. I manually build Impala on my docker container running on one of the Ubuntu vm and when I added my code and tried to test it , I saw a list of issues in multiple files and try to fix them one by one . Issues seen and fix added :- 1) Catalog D failing to start with RuntimeError: Expected 3 impalad(s), only 0 found catalogd failed to start. When I checked the log file , I saw Could not find symbol '_ZN6impala15VectorFunctions17EuclideanDistanceEPN10impala_udf15FunctionContextERKNS1_13CollectionValES6_' error . Other ExprsIr symbols (e.g., MathFunctions) were present, but VectorFunctions symbols were missing. Added the include and calls to InitBuiltinsDummy() as part of scalar-expr-evaluator.cc file. 2) After the above change , when I rebuild (cmake) I was able to start the impala cluster including catalogd and also verified euclidean function registered by manually enetering impala shell but saw an issue while executing the select query . [localhost:21050] default> SELECT euclidean_distance(vec1, vec2) FROM test_vectors_simple; Query: SELECT euclidean_distance(vec1, vec2) FROM test_vectors_simple Query submitted at: 2025-12-13 14:46:01 (Coordinator: http://bf70f323bd8b:25000) 2025-12-13 14:46:02 [Exception] ERROR: Query 044277bc307555a0:e19b0a4400000000 failed: AnalysisException: No matching function with signature: euclidean_distance(ARRAY<FLOAT>, ARRAY<FLOAT>). ArrayType didn't override matchesType() so function matching always failed for array parameters . Added matchesType() override to ArrayType . 3) [localhost:21050] default> SELECT euclidean_distance(vec1, vec2) FROM test_vectors_parquet; Query: SELECT euclidean_distance(vec1, vec2) FROM test_vectors_parquet Query submitted at: 2025-12-13 15:20:59 (Coordinator: http://bf70f323bd8b:25000) Query state can be monitored at: http://bf70f323bd8b:25000/query_plan?query_id=684b9f46758684ab:551cc03a00000000 2025-12-13 15:21:01 [Exception] ERROR: Query 684b9f46758684ab:551cc03a00000000 failed: Builtin 'euclidean_distance' with symbol '_ZN6impala15VectorFunctions17EuclideanDistanceEPN10impala_udf15FunctionContextERKNS1_13CollectionValES6_' does not exist. Verify that all your impalads are the same version. Fragment failed during codegen, fragment index: F01 . The symbol wasn't found during codegen , added vector-functions-ir.cc include to impala-ir.cc . 4) Connection Reset Error , crashed and got disconnected running the below query [localhost:21050] default> SELECT euclidean_distance(vec1, vec2) FROM test_vectors_parquet; Query: SELECT euclidean_distance(vec1, vec2) FROM test_vectors_parquet Query submitted at: 2025-12-13 14:58:13 (Coordinator: http://bf70f323bd8b:25000) Query state can be monitored at: http://bf70f323bd8b:25000/query_plan?query_id=4b44d5a3833d8afc:8e00b52c00000000 2025-12-13 14:58:19 [Exception] type=<class 'ConnectionResetError'> in GetOperationStatus. [Errno 104] Connection reset by peer 2025-12-13 14:58:19 [Exception] Socket error104 [Errno 104] Connection reset by peer Saw the below error for anyval-util.h and anyval-util.cc in impala logs . Store to codegen cache succeeded. CodeGen Cache Key hash_code=f77f2b8197a7acfb:f67765bed3b3095b F20251213 15:26:43.001370 1265839 anyval-util.h:207] 444942f08596faa1:03df497500000000] Check failed: false ARRAY . Added the changes for anyval-util.h , anyvalutil.cc and udf.h to fix the "Unknown type: ARRAY" error . 5) Few challenges faced in adding E2E test cases for table format , array reserved key word , parallel race conditions bcz of instances overwritting the same table , one instance dropping table when other is querying it. Creating and inserting table through hive and then running queries through impala shell as I tested one case manually and then tested my e2e test cases present in vector-distance-functions.test file by running the cmd ( ./tests/run-tests.py query_test/test_exprs.py::TestExprs::test_vector_distance_functions -v -s ) and verifying tests passed coming in the output. Query: SELECT euclidean_distance(vec1, vec2) FROM test_vectors_simple Query submitted at: 2025-12-13 14:54:17 (Coordinator: http://bf70f323bd8b:25000) 2025-12-13 14:54:19 [Exception] ERROR: Query e84abec2f975ac5d:bd1e87a400000000 failed: NotImplementedException: Scan of table 'default.test_vectors_simple' in format 'TEXT' is not supported because the table has a column 'vec1' with a complex type 'ARRAY<FLOAT>'. Complex types are supported for these file formats: PARQUET, ORC, HUDI_PARQUET, PAIMON. I see impala jenkins still failing but unable to access the url . See on slack that this is a recent change and checking with laszlo to see if I can get access to check and debug the cause of failure of jenkins. -- To view, visit http://gerrit.cloudera.org:8080/23730 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Id305acc87530d5d0e53613fe8df9a631ea4e1080 Gerrit-Change-Number: 23730 Gerrit-PatchSet: 9 Gerrit-Owner: Raghav Jindal <[email protected]> Gerrit-Reviewer: Abhishek Rawat <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Michael Smith <[email protected]> Gerrit-Reviewer: Pranav Lodha <[email protected]> Gerrit-Reviewer: Raghav Jindal <[email protected]> Gerrit-Reviewer: Yida Wu <[email protected]> Gerrit-Comment-Date: Mon, 15 Dec 2025 19:14:33 +0000 Gerrit-HasComments: No
