Hi all, FLIP-525/526 bind each ML_PREDICT query to a single model. I'd like to discuss supporting model routing — choosing among several candidate models per row, e.g. cheap vs. strong models, or fallback on failure — and get feedback on the approach.
JIRA: https://issues.apache.org/jira/browse/FLINK-39961 Similar routing/fallback patterns are already common around SQL-based ML/AI products such as BigQuery ML, Databricks, and Snowflake Cortex. The goal here is to support this SQL-native pattern inside Flink's model functions, rather than requiring separate jobs, an external gateway, or UDF glue. Approach: implement routing as a ModelProvider via the existing SPI, i.e. 'provider' = 'routing'. It wraps several candidate models, picks one per row with a pluggable strategy, and delegates to that candidate's existing predict runtime. It reuses the ML_PREDICT lookup-join path, so there are no planner or runtime changes. Built-in strategies would include rule-based routing over input columns, an LLM judge selector, and a classifier/scoring-model selector, plus an ordered fallback chain. *Sketch:* CREATE MODEL support_router INPUT (query STRING, lang STRING) OUTPUT (response STRING) WITH ( 'provider' = 'routing', 'strategy' = 'rule', 'candidates' = 'cheap;smart', 'candidate.cheap.provider' = 'openai', 'candidate.cheap.model' = '...', 'candidate.smart.provider' = 'openai', 'candidate.smart.model' = '...', 'rule.1.when' = 'CHAR_LENGTH(query) > 200', 'rule.1.then' = 'smart', 'default-model' = 'cheap' ); SELECT * FROM ML_PREDICT(TABLE tickets, MODEL support_router, (query, lang)); I have a working prototype with all three strategies, sync + async execution, ordered fallback, metrics, and tests running locally, which I can share as a reference. *Questions:* - Any concern with modeling routing as a ModelProvider, rather than as a planner/DDL-level construct? - Does this warrant a FLIP, given the new configuration surface? - A cleaner surface would let candidates reference existing CREATE MODELs by name. However, ModelProviderFactory.Context cannot resolve another model by name today. Would you support a small SPI addition for model/catalog resolution, or prefer another mechanism for one model to reference another? Thanks, Purshotam Shah
