Could you explain why you need a backfill after you take v2 into production?
Best, Kurt On Fri, Sep 3, 2021 at 2:02 AM zhihao wang <[email protected]> wrote: > Hi team > > Graceful Application Evolvement is a hard and open problem to the > community. We met this problem in our production, too. To address it, we > are planning to leverage a backfill-based approach with a self-build > management system. We'd like to learn the community feedback on this > approach. > > Our job structure is like this: Kafka Inputs -> Flink SQL v1 -> Kafka > Output. We need to keep the Kafka Output interface / address unchanged to > clients. We perform a code change in three steps: > > 1. *Develop Step:* The client launches a new Flink job to update the code > from Flink SQL v1 to Flink SQL v2 with structure: Kafka Inputs -> Flink > SQL v2 -> TEMP Kafka. The new job will read the production inputs and write > into an auto-generated temporary Kafka topic so that there is no pollution > to the Flink SQL v1 job. > > 2. *Deploy Step*: When the client has tested thoroughly and thinks Flink > SQL v2 is ready to be promoted to production (the completeness is judged > manually by clients), the client can deploy Flink SQL v2 logic to > production in one click. Behind the scene, the system will automatically do > the following actions in sequence: > 2.1. The system will take a savepoint of Flink SQL v2 job which > contains all its internal states. > 2.2. The system will stop the Flink SQL v2 job and Flink SQL v1 job. > 2.3. The system will create a new production ready job with structure > Kafka Inputs -> Flink SQL v2 -> Kafka output from 2.1's savepoint. > > 3. *Backfill Step*: After Deploy Step is done, the Flink SQL v2 is already > in production and serves the latest traffic. It’s at the client’s > discretion on when and how fast to perform a backfill to correct all the > records. > > 3.1. Here we need a special form of backfill: For the Flink job, given > one key in the schema of <Kafka Output>, the backfill will 1) send a > Refresh Record e.g. UPSERT <key, latest value> to clients if the key exists > in Flink states. 2) send a Delete Record e.g. DELETE<key, null> to clients > if the key doesn't exist in Flink states. > 3.2. The system will backfill all the records of two sinks in Deploy > Step <Kafka output> and <TEMP Kafka>. The backfill will either refresh > client records’ states or clean up clients’ stale records. > 3.3. After the backfill is completed, the <TEMP Kafka> will be > destroyed automatically by the system. > > Please let us know your opinions on this approach. > > Regards > Zhihao >
