JingsongLi commented on code in PR #8108:
URL: https://github.com/apache/paimon/pull/8108#discussion_r3348867738
##########
paimon-spark/paimon-spark-ut/src/test/java/org/apache/paimon/spark/SparkMultimodalITCase.java:
##########
@@ -128,12 +130,25 @@ public void testVector(@TempDir java.nio.file.Path
tempDir) throws IOException {
"select gid, sid, embs from
vector_search('my_db1.vector_test', 'embs', array(1.0f, 2.0f, 3.0f, 4.0f), 5)
where date = '20260420'")
.collectAsList();
assertThat(rows).hasSize(5);
- Dataset<Row> df =
- spark.sql(
- "select gid, sid, embs, __paimon_vector_search_score
from vector_search('my_db1.vector_test', 'embs', array(1.0f, 2.0f, 3.0f, 4.0f),
5) where date = '20260420'");
+ String vectorSearchSql =
+ "select gid, sid, embs, __paimon_vector_search_score "
+ + "from vector_search('my_db1.vector_test', 'embs',
array(1.0f, 2.0f, 3.0f, 4.0f), 5) "
+ + "where date = '20260420'";
+ Dataset<Row> df = spark.sql(vectorSearchSql);
assertThat(df.columns()).hasSize(4);
rows = df.collectAsList();
assertThat(rows).hasSize(5);
+ spark.sql("set spark.paimon.vector-search.distribute.enabled = true;");
Review Comment:
This assertion does not seem to exercise the new Spark-distributed path: the
table only has a small number of vector splits, while `SparkVectorReadImpl`
falls back to `super.read` unless `splits.size() >= global-index.thread-num *
2` (default 64 splits). Because of that, the serialization/distributed
execution code can be broken and this test would still pass. Could we force the
distributed branch in this test, for example by setting
`spark.paimon.global-index.thread-num=1` or by creating enough index
shards/splits?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]