[DISCUSS] Model routing for ML_PREDICT (FLINK-39961)

Purshotam Shah via dev Mon, 29 Jun 2026 17:47:52 -0700

Hi all,

FLIP-525/526 bind each ML_PREDICT query to a single model. I'd like to
discuss supporting model routing — choosing among several candidate models
per row, e.g. cheap vs. strong models, or fallback on failure — and get
feedback on the approach.


JIRA: https://issues.apache.org/jira/browse/FLINK-39961

Similar routing/fallback patterns are already common around SQL-based ML/AI
products such as BigQuery ML, Databricks, and Snowflake Cortex. The goal
here is to support this SQL-native pattern inside Flink's model functions,
rather than requiring separate jobs, an external gateway, or UDF glue.

Approach: implement routing as a ModelProvider via the existing SPI, i.e.
'provider' = 'routing'. It wraps several candidate models, picks one per
row with a pluggable strategy, and delegates to that candidate's existing
predict runtime. It reuses the ML_PREDICT lookup-join path, so there are no
planner or runtime changes.

Built-in strategies would include rule-based routing over input columns, an
LLM judge selector, and a classifier/scoring-model selector, plus an
ordered fallback chain.

*Sketch:*

CREATE MODEL support_router
  INPUT (query STRING, lang STRING)
  OUTPUT (response STRING)
  WITH (
  'provider' = 'routing',
  'strategy' = 'rule',
  'candidates' = 'cheap;smart',
  'candidate.cheap.provider' = 'openai',
  'candidate.cheap.model' = '...',
  'candidate.smart.provider' = 'openai',
  'candidate.smart.model' = '...',
  'rule.1.when' = 'CHAR_LENGTH(query) > 200',
  'rule.1.then' = 'smart',
  'default-model' = 'cheap'
);

SELECT * FROM ML_PREDICT(TABLE tickets, MODEL support_router, (query,
lang));

I have a working prototype with all three strategies, sync + async
execution, ordered fallback, metrics, and tests running locally, which I
can share as a reference.

*Questions:*

   -

   Any concern with modeling routing as a ModelProvider, rather than as a
   planner/DDL-level construct?
   -

   Does this warrant a FLIP, given the new configuration surface?
   -

   A cleaner surface would let candidates reference existing CREATE MODELs
   by name. However, ModelProviderFactory.Context cannot resolve another model
   by name today. Would you support a small SPI addition for model/catalog
   resolution, or prefer another mechanism for one model to reference another?

Thanks,
Purshotam Shah

[DISCUSS] Model routing for ML_PREDICT (FLINK-39961)

Reply via email to