We backfill from two sources for two different reasons.

1. We write the outputs into a TEMP Kafka first to allow the v2 job to
cold-start. The events accumulated in TEMP Kafka need to find a way to
redirect to the final sink. A simple way is to just move the events in TEMP
Kafka into the final sink. But it brings stale events. We rely on the
backfill to get the latest state for each event when we move events from
TEMP kafka to the final sink.

2. The final sink already has v1 generated output, which needs to be
cleaned up after we switch to v2 job. For example, v2 logic might be "where
field > 0" while v1 logic might be "where field >=0". For the field = 0
events, If we don't clean up it, it will exist in the final sink forever,
which is wrong. We rely on backfill to clean up such to-be-deleted records.

On Fri, Sep 3, 2021 at 9:15 AM Kurt Young <[email protected]> wrote:

>  Could you explain why you need a backfill after you take v2 into
> production?
>
> Best,
> Kurt
>
>
> On Fri, Sep 3, 2021 at 2:02 AM zhihao wang <[email protected]> wrote:
>
> > Hi team
> >
> > Graceful Application Evolvement is a hard and open problem to the
> > community. We met this problem in our production, too. To address it, we
> > are planning to leverage a backfill-based approach with a self-build
> > management system. We'd like to learn the community feedback on this
> > approach.
> >
> > Our job structure is like this: Kafka Inputs -> Flink SQL v1 -> Kafka
> > Output. We need to keep the Kafka Output interface / address unchanged to
> > clients. We perform a code change in three steps:
> >
> > 1. *Develop Step:* The client launches a new Flink job to update the code
> > from Flink SQL v1 to Flink SQL v2 with structure:  Kafka Inputs -> Flink
> > SQL v2 -> TEMP Kafka. The new job will read the production inputs and
> write
> > into an auto-generated temporary Kafka topic so that there is no
> pollution
> > to the Flink SQL v1 job.
> >
> > 2. *Deploy Step*: When the client has tested thoroughly and thinks Flink
> > SQL v2 is ready to be promoted to production (the completeness is judged
> > manually by clients), the client can deploy Flink SQL v2 logic to
> > production in one click. Behind the scene, the system will automatically
> do
> > the following actions in sequence:
> >     2.1. The system will take a savepoint of Flink SQL v2 job which
> > contains all its internal states.
> >     2.2. The system will stop the Flink SQL v2 job and Flink SQL v1 job.
> >     2.3. The system will create a new production ready job with structure
> >  Kafka Inputs -> Flink SQL v2 -> Kafka output from 2.1's savepoint.
> >
> > 3. *Backfill Step*: After Deploy Step is done, the Flink SQL v2 is
> already
> > in production and serves the latest traffic. It’s at the client’s
> > discretion on when and how fast to perform a backfill to correct all the
> > records.
> >
> >     3.1. Here we need a special form of backfill: For the Flink job,
> given
> > one key in the schema of <Kafka Output>, the backfill will 1) send a
> > Refresh Record e.g. UPSERT <key, latest value> to clients if the key
> exists
> > in Flink states. 2) send a Delete Record e.g. DELETE<key, null> to
> clients
> > if the key doesn't exist in Flink states.
> >     3.2. The system will backfill all the records of two sinks in Deploy
> > Step <Kafka output> and <TEMP Kafka>. The backfill will either refresh
> > client records’ states or clean up clients’ stale records.
> >     3.3. After the backfill is completed, the <TEMP Kafka> will be
> > destroyed automatically by the system.
> >
> > Please let us know your opinions on this approach.
> >
> > Regards
> > Zhihao
> >
>

Reply via email to