Re: [PR] [spark] support distributed execution of vector search on spark [paimon]

via GitHub Wed, 03 Jun 2026 06:32:39 -0700


JingsongLi commented on code in PR #8108:
URL: https://github.com/apache/paimon/pull/8108#discussion_r3348867738



##########
paimon-spark/paimon-spark-ut/src/test/java/org/apache/paimon/spark/SparkMultimodalITCase.java:
##########
@@ -128,12 +130,25 @@ public void testVector(@TempDir java.nio.file.Path 
tempDir) throws IOException {
                                 "select gid, sid,  embs from 
vector_search('my_db1.vector_test', 'embs', array(1.0f, 2.0f, 3.0f, 4.0f), 5)  
where date = '20260420'")
                         .collectAsList();
         assertThat(rows).hasSize(5);
-        Dataset<Row> df =
-                spark.sql(
-                        "select gid, sid,  embs, __paimon_vector_search_score 
from vector_search('my_db1.vector_test', 'embs', array(1.0f, 2.0f, 3.0f, 4.0f), 
5)  where date = '20260420'");
+        String vectorSearchSql =
+                "select gid, sid,  embs, __paimon_vector_search_score "
+                        + "from vector_search('my_db1.vector_test', 'embs', 
array(1.0f, 2.0f, 3.0f, 4.0f), 5) "
+                        + "where date = '20260420'";
+        Dataset<Row> df = spark.sql(vectorSearchSql);
         assertThat(df.columns()).hasSize(4);
         rows = df.collectAsList();
         assertThat(rows).hasSize(5);
+        spark.sql("set spark.paimon.vector-search.distribute.enabled = true;");

Review Comment:
   This assertion does not seem to exercise the new Spark-distributed path: the 
table only has a small number of vector splits, while `SparkVectorReadImpl` 
falls back to `super.read` unless `splits.size() >= global-index.thread-num * 
2` (default 64 splits). Because of that, the serialization/distributed 
execution code can be broken and this test would still pass. Could we force the 
distributed branch in this test, for example by setting 
`spark.paimon.global-index.thread-num=1` or by creating enough index 
shards/splits?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [spark] support distributed execution of vector search on spark [paimon]

Reply via email to