Hi Jane,
Thanks for your detailed response.
You mentioned that there are 10k+ SQL jobs in your production
environment, but only ~100 jobs' migration involves plan editing. Is 10k+
the number of total jobs, or the number of jobs that use stateful
computation and need state migration?
10k is the number of SQL jobs that enable periodic checkpoint. And
surely if users change their sql which result in changes of the plan, they
need to do state migration.
- You mentioned that "A truth that can not be ignored is that users
usually tend to give up editing TTL(or operator ID in our case) instead of
migrating this configuration between their versions of one given job." So
what would users prefer to do if they're reluctant to edit the operator
ID? Would they submit the same SQL as a new job with a higher version to
re-accumulating the state from the earliest offset?
You're exactly right. People will tend to re-accumulate the state from a
given offset by changing the namespace of their checkpoint.
Namespace is an internal concept and restarting the sql job in a new
namespace can be simply understood as submitting a new job.
Back to your suggestions, I noticed that FLIP-190 [3] proposed the
following syntax to perform plan migration
The 'plan migration' I said in my last reply may be inaccurate. It's more
like 'query evolution'. In other word, if a user submitted a sql job with a
configured compiled plan, and then
he changes the sql, the compiled plan changes too, how to move the
configuration in the old plan to the new plan.
IIUC, FLIP-190 aims to solve issues in flink version upgrades and leave out
the 'query evolution' which is a fundamental change to the query. E.g.
adding a filter condition, a different aggregation.
And I'm really looking forward to a solution for query evolution.
And I'm also curious about how to use the hint
approach to cover cases like
- configuring TTL for operators like ChangelogNormalize,
SinkUpsertMaterializer, etc., these operators are derived by the planner
implicitly
- cope with two/multiple input stream operator's state TTL, like join,
and other operations like row_number, rank, correlate, etc.
Actually, in our company , we make operators in the query block where the
hint locates all affected by that hint. For example,
INSERT INTO sink
SELECT /*+ STATE_TTL('1D') */
id,
name,
num
FROM (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY num DESC) as row_num
FROM (
SELECT
*
FROM (
SELECT
id,
name,
max(num) as num
FROM source1
GROUP BY
id, name, TUMBLE(proc, INTERVAL '1' MINUTE)
)
GROUP BY
id, name, num
)
)
WHERE row_num = 1
In the SQL above, the state TTL of Rank and Agg will be all configured as 1
day. If users want to set different TTL for Rank and Agg, they can just
make these two queries located in two different query blocks.
It looks quite rough but straightforward enough. For each side of join
operator, one of my users proposed a syntax like below:
/*+
JOIN_TTL('tables'='left_talbe,right_table','left_ttl'='100000','right_ttl'='10000')
*/
We haven't accepted this proposal now, maybe we could find some better
design for this kind of case. Just for your information.
I think if we want to utilize hints to support fine-grained configuration,
we can open a new FLIP to discuss it.
BTW, personally, I'm interested in how to design a graphical interface to
help users to maintain their custom fine-grained configuration between
their job versions.
Best regards,
Yisha