Re: [DISCUSS] FLIP-540: Support VECTOR_SEARCH in Flink SQL

Shengkai Fang Fri, 15 Aug 2025 03:00:09 -0700

Hi, Timo. Thanks for your explanation.

2) RowTime


I agree that we should maintain conceptual consistency as much as possible
to reduce the learning curve for users. I’ve updated the FLIP to include
the rowtime column in the output when the on_time attribute is specified.

Best,
Shengkai

Timo Walther <twal...@apache.org> 于2025年8月15日周五 16:23写道：

> Hi Shengkai,
>
> thanks for the quick response.
>
> 2) RowTime
>
>  > This results in two timestamp fields. Having both may be confusing.
>  > Is this the intended behavior?
>
> I agree that this is not ideal. Our SESSION window PTFs have a similar
> problem, they also return 2 timestamp columns when you run them without
> aggregation:
>
> SELECT *
> FROM SESSION(
>    DATA => TABLE t PARTITION BY name,
>    TIMECOL => DESCRIPTOR(ts),
>    GAP => INTERVAL '5' MINUTES)
>
> And both qualify as a time attribute.
>
> My goal is to somehow reach consistency between different PTFs such that
> users only need to learn concept and syntax once. But we could say that
> correlated PTFs work slightly different than user-defined PTFs. Since
> correlated PTFs have always columns coming from the left side.
>
> For ML_PREDICT we also have a similar challenge that `on_time` and `uid`
> should not be available. So another mismatch to user-defined PTFs.
>
> Thinking about the future, I guess we could add more
> StaticArgumentTraits for the table argument during declaration if we
> ever want to expose correlated PTFs to users. Something like...
>
> public static class UserDefinedCorrelatedPtf
>    extends ProcessTableFunction<Row> {
>
> public void eval(
>    @ArgumentHint({CORRELATED_TABLE}) Row r,
>    Integer i) {
>      ...
>    }
>
> }
>
> It would disable emitting the rowtime column in output and disallow
> setting timers.
>
> 3) Naming
>
> No string opinion. We can also go with SEARCH_VECTOR.
>
> Cheers,
> Timo
>
>
> On 15.08.25 07:12, Shengkai Fang wrote:
> > Hi Timo, thank you for your detailed suggestions. Please see my responses
> > below.
> >
> > 1) ProcTime
> >
> > +1 for aligning the behavior with PTF. I’ve updated the FLIP accordingly.
> >
> > 2) RowTime
> >
> > I have some concerns regarding the `ROWTIME` handling. Let me illustrate
> > with an example.
> >
> > Suppose the input table schema is:
> > `<query_col ARRAY<FLOAT>, ts TIMESTAMP(3) *ROWTIME*>`
> > and the vector table schema is:
> > `<id INT, search_col ARRAY<FLOAT>>`
> >
> > Using the following SQL:
> > ```sql
> > SELECT * FROM input_table, LATERAL TABLE(VECTOR_SEARCH(
> >     SEARCH_TABLE => TABLE vector_table,
> >     COLUMN_TO_SEARCH => DESCRIPTOR(search_col),
> >     COLUMN_TO_QUERY => input_table.query_col,
> >     ON_TIME => input_table.ts))
> > ```
> >
> > The output schema becomes:
> > ROW<query_col ARRAY<FLOAT>, ts TIMESTAMP(3), id INT, search_col
> > ARRAY<FLOAT>, score DOUBLE, ts0 TIMESTAMP(3)>
> >
> > This results in two timestamp fields: ts (from input) and ts0 (generated
> by
> > the operator).
> > Having both may be confusing. Is this the intended behavior?
> >
> > 3) Naming
> >
> > I did consider SEARCH_VECTOR, but many vendors use VECTOR_SEARCH — for
> > example, Spark[1] and BigQuery[2].
> > To maintain consistency and reduce the learning curve, I suggest aligning
> > with existing industry practice.
> >
> >
> > [1]
> >
> https://docs.databricks.com/aws/en/sql/language-manual/functions/vector_search
> > [2] https://cloud.google.com/bigquery/docs/vector-search-intro
> >
> > Best,
> > Shengkai
> >
> > Timo Walther <twal...@apache.org> 于2025年8月14日周四 21:49写道：
> >
> >> Hi Shengkai,
> >>
> >> thank you for proposing this FLIP. Also, thank you for considering my
> >> thoughts from FLIP-517, even though I haven't managed to finalize the
> >> discussion/voting yet.
> >>
> >> It looks mostly good to me. However, I would like to discuss the
> >> semantics of the `on_time` parameter:
> >>
> >> 1) Proctime
> >>
> >> I truly believe we should avoid the need for a `proctime` attribute.
> >> Teaching the rowtime attributes to users is already painful enough, but
> >> additionally teaching proctime is worse. For PTFs of FLIP-440, only
> >> rowtime attributes can be used in f(on_time => ...) and we should do the
> >> same for future built-in PTFs. Not specifying `on_time` can be equal to
> >> proctime.
> >>
> >> So users can just naturally use the PTF, with the mental model of
> >> LITERAL being a foreach loop where each invocation happens instantly (in
> >> processing time).
> >>
> >> 2) Rowtime
> >>
> >> All PTFs should follow the SystemTypeInference:
> >>
> >>
> >>
> https://github.com/apache/flink/blob/master/flink-table/flink-table-common/src/main/java/org/apache/flink/table/types/inference/SystemTypeInference.java#L239
> >>
> >> It assumes that when an `on_time`  parameter is passed, the result
> >> appends a `rowtime` column that can be used in subsequent time based
> >> operations. Can we add such a column in the output for VECTOR_SEARCH as
> >> well?
> >>
> >> 3) Naming
> >>
> >> Just a general note, feel free to ignore: A function or operationshould
> >> use a verb not a noun. E.g. JOIN, SEARCH, SELECT. Vector search is a
> >> concept. The function should rather be called `SEARCH_VECTOR`. This was
> >> also explained in FLIP-517.
> >>
> >> Thanks,
> >> Timo
> >>
> >>
> >> On 14.08.25 03:31, Shengkai Fang wrote:
> >>> Hi, all.
> >>>
> >>> There has been no feedback for a while. I plan to close this FLIP
> >> tomorrow
> >>> unless there are further comments. Thank you all for the discussion.
> >>>
> >>> Best,
> >>> Shengkai
> >>>
> >>> Yash Anand <yashanand.0...@gmail.com> 于2025年7月31日周四 15:47写道：
> >>>
> >>>> Hi Shengkai,
> >>>>
> >>>> Thanks for the FLIP, this will be a great addition to flink AI
> >>>> capabilities. +1 for this feature.
> >>>>
> >>>> Best,
> >>>> Yash Anand
> >>>>
> >>>> On Tue, Jul 29, 2025 at 7:23 PM Jacky Lau <liuyong...@gmail.com>
> wrote:
> >>>>
> >>>>> Hi Shengkai,
> >>>>>
> >>>>> Thanks for the FLIP and enhancement for AI capabilities in Flink. +1
> >> for
> >>>>> this feature
> >>>>>
> >>>>> Best,
> >>>>> Jacky Lau
> >>>>>
> >>>>> Hao Li <h...@confluent.io.invalid> 于2025年7月30日周三 01:03写道：
> >>>>>
> >>>>>> Hi Shengkai,
> >>>>>>
> >>>>>> Thanks for the FLIP and enhancement for AI capabilities in Flink.
> +1.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Hao
> >>>>>>
> >>>>>> On Tue, Jul 29, 2025 at 2:16 AM Shengkai Fang <fskm...@gmail.com>
> >>>> wrote:
> >>>>>>
> >>>>>>> Hi,
> >>>>>>> I'd like to start a discussion of FLIP-540: Support VECTOR_SEARCH
> in
> >>>>>> Flink
> >>>>>>> SQL[1].
> >>>>>>>
> >>>>>>> In FLIP-437/FLIP-525, Apache Flink has initially integrated Large
> >>>>>> Language
> >>>>>>> Model (LLM) capabilities, enabling semantic understanding and
> >>>> real-time
> >>>>>>> processing of streaming data pipelines. This integration has been
> >>>>>>> technically validated in scenarios such as log classification and
> >>>>>> real-time
> >>>>>>> question-answering systems. However, the current architecture
> allows
> >>>>>> Flink
> >>>>>>> to only use embedding models to convert unstructured data (e.g.,
> >>>> text,
> >>>>>>> images) into high-dimensional vector features, which are then
> >>>> persisted
> >>>>>> to
> >>>>>>> downstream storage systems (e.g., Milvus, Mongodb). It lacks
> >>>> real-time
> >>>>>>> online querying and similarity analysis capabilities for vector
> >>>> spaces.
> >>>>>> To
> >>>>>>> address this limitation, we propose introducing the VECTOR_SEARCH
> >>>>>> function
> >>>>>>> in this FLIP, enabling users to perform streaming vector similarity
> >>>>>>> searches and real-time context retrieval (e.g., Retrieval-Augmented
> >>>>>>> Generation, RAG) directly within Flink.
> >>>>>>>
> >>>>>>> Looking forward to comments and suggestions for improvements!
> >>>>>>>
> >>>>>>> Best,
> >>>>>>> Shengkai
> >>>>>>>
> >>>>>>> [1]
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-540%3A+Support+VECTOR_SEARCH+in+Flink+SQL
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >>
> >
>
>

Re: [DISCUSS] FLIP-540: Support VECTOR_SEARCH in Flink SQL

Reply via email to