Re: [DISCUSS] FLIP-540: Support VECTOR_SEARCH in Flink SQL

Timo Walther Mon, 18 Aug 2025 00:08:25 -0700

Hi Shengkai,

thanks for updating the FLIP. The FLIP looks good to me. Again, if youor others think the rowtime column in the output is just duplicate data,I'm also fine dropping the column for all correlated PTFs.


In one of the next Calcite versions, we should remove the TABLE() function:

LATERAL TABLE(VECTOR_SEARCH()) -> LATERAL VECTOR_SEARCH()

This could read nicer and has been done for non-correlated PTFs already.

Thanks,
Timo


On 15.08.25 11:59, Shengkai Fang wrote:

Hi, Timo. Thanks for your explanation.

2) RowTime

I agree that we should maintain conceptual consistency as much as possible
to reduce the learning curve for users. I’ve updated the FLIP to include
the rowtime column in the output when the on_time attribute is specified.

Best,
Shengkai

Timo Walther <[email protected]> 于2025年8月15日周五 16:23写道：

Hi Shengkai,

thanks for the quick response.

2) RowTime

  > This results in two timestamp fields. Having both may be confusing.
  > Is this the intended behavior?

I agree that this is not ideal. Our SESSION window PTFs have a similar
problem, they also return 2 timestamp columns when you run them without
aggregation:

SELECT *
FROM SESSION(
    DATA => TABLE t PARTITION BY name,
    TIMECOL => DESCRIPTOR(ts),
    GAP => INTERVAL '5' MINUTES)

And both qualify as a time attribute.

My goal is to somehow reach consistency between different PTFs such that
users only need to learn concept and syntax once. But we could say that
correlated PTFs work slightly different than user-defined PTFs. Since
correlated PTFs have always columns coming from the left side.

For ML_PREDICT we also have a similar challenge that `on_time` and `uid`
should not be available. So another mismatch to user-defined PTFs.

Thinking about the future, I guess we could add more
StaticArgumentTraits for the table argument during declaration if we
ever want to expose correlated PTFs to users. Something like...

public static class UserDefinedCorrelatedPtf
    extends ProcessTableFunction<Row> {

public void eval(
    @ArgumentHint({CORRELATED_TABLE}) Row r,
    Integer i) {
      ...
    }

}

It would disable emitting the rowtime column in output and disallow
setting timers.

3) Naming

No string opinion. We can also go with SEARCH_VECTOR.

Cheers,
Timo

On 15.08.25 07:12, Shengkai Fang wrote:

Hi Timo, thank you for your detailed suggestions. Please see my responses
below.

1) ProcTime

+1 for aligning the behavior with PTF. I’ve updated the FLIP accordingly.

2) RowTime

I have some concerns regarding the `ROWTIME` handling. Let me illustrate
with an example.

Suppose the input table schema is:
`<query_col ARRAY<FLOAT>, ts TIMESTAMP(3) *ROWTIME*>`
and the vector table schema is:
`<id INT, search_col ARRAY<FLOAT>>`

Using the following SQL:
```sql
SELECT * FROM input_table, LATERAL TABLE(VECTOR_SEARCH(
     SEARCH_TABLE => TABLE vector_table,
     COLUMN_TO_SEARCH => DESCRIPTOR(search_col),
     COLUMN_TO_QUERY => input_table.query_col,
     ON_TIME => input_table.ts))
```

The output schema becomes:
ROW<query_col ARRAY<FLOAT>, ts TIMESTAMP(3), id INT, search_col
ARRAY<FLOAT>, score DOUBLE, ts0 TIMESTAMP(3)>

This results in two timestamp fields: ts (from input) and ts0 (generated

by

the operator).
Having both may be confusing. Is this the intended behavior?

3) Naming

I did consider SEARCH_VECTOR, but many vendors use VECTOR_SEARCH — for
example, Spark[1] and BigQuery[2].
To maintain consistency and reduce the learning curve, I suggest aligning
with existing industry practice.


[1]

https://docs.databricks.com/aws/en/sql/language-manual/functions/vector_search

[2] https://cloud.google.com/bigquery/docs/vector-search-intro

Best,
Shengkai

Timo Walther <[email protected]> 于2025年8月14日周四 21:49写道：

Hi Shengkai,

thank you for proposing this FLIP. Also, thank you for considering my
thoughts from FLIP-517, even though I haven't managed to finalize the
discussion/voting yet.

It looks mostly good to me. However, I would like to discuss the
semantics of the `on_time` parameter:

1) Proctime

I truly believe we should avoid the need for a `proctime` attribute.
Teaching the rowtime attributes to users is already painful enough, but
additionally teaching proctime is worse. For PTFs of FLIP-440, only
rowtime attributes can be used in f(on_time => ...) and we should do the
same for future built-in PTFs. Not specifying `on_time` can be equal to
proctime.

So users can just naturally use the PTF, with the mental model of
LITERAL being a foreach loop where each invocation happens instantly (in
processing time).

2) Rowtime

All PTFs should follow the SystemTypeInference:

https://github.com/apache/flink/blob/master/flink-table/flink-table-common/src/main/java/org/apache/flink/table/types/inference/SystemTypeInference.java#L239


It assumes that when an `on_time`  parameter is passed, the result
appends a `rowtime` column that can be used in subsequent time based
operations. Can we add such a column in the output for VECTOR_SEARCH as
well?

3) Naming

Just a general note, feel free to ignore: A function or operationshould
use a verb not a noun. E.g. JOIN, SEARCH, SELECT. Vector search is a
concept. The function should rather be called `SEARCH_VECTOR`. This was
also explained in FLIP-517.

Thanks,
Timo


On 14.08.25 03:31, Shengkai Fang wrote:

Hi, all.

There has been no feedback for a while. I plan to close this FLIP

tomorrow

unless there are further comments. Thank you all for the discussion.

Best,
Shengkai

Yash Anand <[email protected]> 于2025年7月31日周四 15:47写道：

Hi Shengkai,

Thanks for the FLIP, this will be a great addition to flink AI
capabilities. +1 for this feature.

Best,
Yash Anand

On Tue, Jul 29, 2025 at 7:23 PM Jacky Lau <[email protected]>

wrote:

Hi Shengkai,

Thanks for the FLIP and enhancement for AI capabilities in Flink. +1

for

this feature

Best,
Jacky Lau

Hao Li <[email protected]> 于2025年7月30日周三 01:03写道：

Hi Shengkai,

Thanks for the FLIP and enhancement for AI capabilities in Flink.

+1.


Thanks,
Hao

On Tue, Jul 29, 2025 at 2:16 AM Shengkai Fang <[email protected]>

wrote:

Hi,
I'd like to start a discussion of FLIP-540: Support VECTOR_SEARCH

in

Flink

SQL[1].

In FLIP-437/FLIP-525, Apache Flink has initially integrated Large

Language

Model (LLM) capabilities, enabling semantic understanding and

real-time

processing of streaming data pipelines. This integration has been
technically validated in scenarios such as log classification and

real-time

question-answering systems. However, the current architecture

allows

Flink

to only use embedding models to convert unstructured data (e.g.,

text,

images) into high-dimensional vector features, which are then

persisted

to

downstream storage systems (e.g., Milvus, Mongodb). It lacks

real-time

online querying and similarity analysis capabilities for vector

spaces.

To

address this limitation, we propose introducing the VECTOR_SEARCH

function

in this FLIP, enabling users to perform streaming vector similarity
searches and real-time context retrieval (e.g., Retrieval-Augmented
Generation, RAG) directly within Flink.

Looking forward to comments and suggestions for improvements!

Best,
Shengkai

[1]

https://cwiki.apache.org/confluence/display/FLINK/FLIP-540%3A+Support+VECTOR_SEARCH+in+Flink+SQL

Re: [DISCUSS] FLIP-540: Support VECTOR_SEARCH in Flink SQL

Reply via email to