Could you explain why you need a backfill after you take v2 into
production?

Best,
Kurt


On Fri, Sep 3, 2021 at 2:02 AM zhihao wang <[email protected]> wrote:

> Hi team
>
> Graceful Application Evolvement is a hard and open problem to the
> community. We met this problem in our production, too. To address it, we
> are planning to leverage a backfill-based approach with a self-build
> management system. We'd like to learn the community feedback on this
> approach.
>
> Our job structure is like this: Kafka Inputs -> Flink SQL v1 -> Kafka
> Output. We need to keep the Kafka Output interface / address unchanged to
> clients. We perform a code change in three steps:
>
> 1. *Develop Step:* The client launches a new Flink job to update the code
> from Flink SQL v1 to Flink SQL v2 with structure:  Kafka Inputs -> Flink
> SQL v2 -> TEMP Kafka. The new job will read the production inputs and write
> into an auto-generated temporary Kafka topic so that there is no pollution
> to the Flink SQL v1 job.
>
> 2. *Deploy Step*: When the client has tested thoroughly and thinks Flink
> SQL v2 is ready to be promoted to production (the completeness is judged
> manually by clients), the client can deploy Flink SQL v2 logic to
> production in one click. Behind the scene, the system will automatically do
> the following actions in sequence:
>     2.1. The system will take a savepoint of Flink SQL v2 job which
> contains all its internal states.
>     2.2. The system will stop the Flink SQL v2 job and Flink SQL v1 job.
>     2.3. The system will create a new production ready job with structure
>  Kafka Inputs -> Flink SQL v2 -> Kafka output from 2.1's savepoint.
>
> 3. *Backfill Step*: After Deploy Step is done, the Flink SQL v2 is already
> in production and serves the latest traffic. It’s at the client’s
> discretion on when and how fast to perform a backfill to correct all the
> records.
>
>     3.1. Here we need a special form of backfill: For the Flink job, given
> one key in the schema of <Kafka Output>, the backfill will 1) send a
> Refresh Record e.g. UPSERT <key, latest value> to clients if the key exists
> in Flink states. 2) send a Delete Record e.g. DELETE<key, null> to clients
> if the key doesn't exist in Flink states.
>     3.2. The system will backfill all the records of two sinks in Deploy
> Step <Kafka output> and <TEMP Kafka>. The backfill will either refresh
> client records’ states or clean up clients’ stale records.
>     3.3. After the backfill is completed, the <TEMP Kafka> will be
> destroyed automatically by the system.
>
> Please let us know your opinions on this approach.
>
> Regards
> Zhihao
>

Reply via email to