One idea for Graceful Application Evolvement

zhihao wang Thu, 02 Sep 2021 11:02:22 -0700

Hi team

Graceful Application Evolvement is a hard and open problem to the
community. We met this problem in our production, too. To address it, we
are planning to leverage a backfill-based approach with a self-build
management system. We'd like to learn the community feedback on this
approach.


Our job structure is like this: Kafka Inputs -> Flink SQL v1 -> Kafka
Output. We need to keep the Kafka Output interface / address unchanged to
clients. We perform a code change in three steps:

1. *Develop Step:* The client launches a new Flink job to update the code
from Flink SQL v1 to Flink SQL v2 with structure:  Kafka Inputs -> Flink
SQL v2 -> TEMP Kafka. The new job will read the production inputs and write
into an auto-generated temporary Kafka topic so that there is no pollution
to the Flink SQL v1 job.

2. *Deploy Step*: When the client has tested thoroughly and thinks Flink
SQL v2 is ready to be promoted to production (the completeness is judged
manually by clients), the client can deploy Flink SQL v2 logic to
production in one click. Behind the scene, the system will automatically do
the following actions in sequence:
    2.1. The system will take a savepoint of Flink SQL v2 job which
contains all its internal states.
    2.2. The system will stop the Flink SQL v2 job and Flink SQL v1 job.
    2.3. The system will create a new production ready job with structure
 Kafka Inputs -> Flink SQL v2 -> Kafka output from 2.1's savepoint.

3. *Backfill Step*: After Deploy Step is done, the Flink SQL v2 is already
in production and serves the latest traffic. It’s at the client’s
discretion on when and how fast to perform a backfill to correct all the
records.

    3.1. Here we need a special form of backfill: For the Flink job, given
one key in the schema of <Kafka Output>, the backfill will 1) send a
Refresh Record e.g. UPSERT <key, latest value> to clients if the key exists
in Flink states. 2) send a Delete Record e.g. DELETE<key, null> to clients
if the key doesn't exist in Flink states.
    3.2. The system will backfill all the records of two sinks in Deploy
Step <Kafka output> and <TEMP Kafka>. The backfill will either refresh
client records’ states or clean up clients’ stale records.
    3.3. After the backfill is completed, the <TEMP Kafka> will be
destroyed automatically by the system.

Please let us know your opinions on this approach.

Regards
Zhihao

One idea for Graceful Application Evolvement

Reply via email to