+ team On Fri, Sep 3, 2021 at 9:37 AM zhihao wang <[email protected]> wrote:
> We backfill from two sources for two different reasons. > > 1. We write the outputs into a TEMP Kafka first to allow the v2 job to > cold-start. The events accumulated in TEMP Kafka need to find a way to > redirect to the final sink. A simple way is to just move the events in TEMP > Kafka into the final sink. But it brings stale events. We rely on the > backfill to get the latest state for each event when we move events from > TEMP kafka to the final sink. > > 2. The final sink already has v1 generated output, which needs to be > cleaned up after we switch to v2 job. For example, v2 logic might be "where > field > 0" while v1 logic might be "where field >=0". For the field = 0 > events, If we don't clean up it, it will exist in the final sink forever, > which is wrong. We rely on backfill to clean up such to-be-deleted records. > > On Fri, Sep 3, 2021 at 9:15 AM Kurt Young <[email protected]> wrote: > >> Could you explain why you need a backfill after you take v2 into >> production? >> >> Best, >> Kurt >> >> >> On Fri, Sep 3, 2021 at 2:02 AM zhihao wang <[email protected]> wrote: >> >> > Hi team >> > >> > Graceful Application Evolvement is a hard and open problem to the >> > community. We met this problem in our production, too. To address it, we >> > are planning to leverage a backfill-based approach with a self-build >> > management system. We'd like to learn the community feedback on this >> > approach. >> > >> > Our job structure is like this: Kafka Inputs -> Flink SQL v1 -> Kafka >> > Output. We need to keep the Kafka Output interface / address unchanged >> to >> > clients. We perform a code change in three steps: >> > >> > 1. *Develop Step:* The client launches a new Flink job to update the >> code >> > from Flink SQL v1 to Flink SQL v2 with structure: Kafka Inputs -> Flink >> > SQL v2 -> TEMP Kafka. The new job will read the production inputs and >> write >> > into an auto-generated temporary Kafka topic so that there is no >> pollution >> > to the Flink SQL v1 job. >> > >> > 2. *Deploy Step*: When the client has tested thoroughly and thinks Flink >> > SQL v2 is ready to be promoted to production (the completeness is judged >> > manually by clients), the client can deploy Flink SQL v2 logic to >> > production in one click. Behind the scene, the system will >> automatically do >> > the following actions in sequence: >> > 2.1. The system will take a savepoint of Flink SQL v2 job which >> > contains all its internal states. >> > 2.2. The system will stop the Flink SQL v2 job and Flink SQL v1 job. >> > 2.3. The system will create a new production ready job with >> structure >> > Kafka Inputs -> Flink SQL v2 -> Kafka output from 2.1's savepoint. >> > >> > 3. *Backfill Step*: After Deploy Step is done, the Flink SQL v2 is >> already >> > in production and serves the latest traffic. It’s at the client’s >> > discretion on when and how fast to perform a backfill to correct all the >> > records. >> > >> > 3.1. Here we need a special form of backfill: For the Flink job, >> given >> > one key in the schema of <Kafka Output>, the backfill will 1) send a >> > Refresh Record e.g. UPSERT <key, latest value> to clients if the key >> exists >> > in Flink states. 2) send a Delete Record e.g. DELETE<key, null> to >> clients >> > if the key doesn't exist in Flink states. >> > 3.2. The system will backfill all the records of two sinks in Deploy >> > Step <Kafka output> and <TEMP Kafka>. The backfill will either refresh >> > client records’ states or clean up clients’ stale records. >> > 3.3. After the backfill is completed, the <TEMP Kafka> will be >> > destroyed automatically by the system. >> > >> > Please let us know your opinions on this approach. >> > >> > Regards >> > Zhihao >> > >> >
