+ team

On Fri, Sep 3, 2021 at 9:37 AM zhihao wang <[email protected]> wrote:

> We backfill from two sources for two different reasons.
>
> 1. We write the outputs into a TEMP Kafka first to allow the v2 job to
> cold-start. The events accumulated in TEMP Kafka need to find a way to
> redirect to the final sink. A simple way is to just move the events in TEMP
> Kafka into the final sink. But it brings stale events. We rely on the
> backfill to get the latest state for each event when we move events from
> TEMP kafka to the final sink.
>
> 2. The final sink already has v1 generated output, which needs to be
> cleaned up after we switch to v2 job. For example, v2 logic might be "where
> field > 0" while v1 logic might be "where field >=0". For the field = 0
> events, If we don't clean up it, it will exist in the final sink forever,
> which is wrong. We rely on backfill to clean up such to-be-deleted records.
>
> On Fri, Sep 3, 2021 at 9:15 AM Kurt Young <[email protected]> wrote:
>
>>  Could you explain why you need a backfill after you take v2 into
>> production?
>>
>> Best,
>> Kurt
>>
>>
>> On Fri, Sep 3, 2021 at 2:02 AM zhihao wang <[email protected]> wrote:
>>
>> > Hi team
>> >
>> > Graceful Application Evolvement is a hard and open problem to the
>> > community. We met this problem in our production, too. To address it, we
>> > are planning to leverage a backfill-based approach with a self-build
>> > management system. We'd like to learn the community feedback on this
>> > approach.
>> >
>> > Our job structure is like this: Kafka Inputs -> Flink SQL v1 -> Kafka
>> > Output. We need to keep the Kafka Output interface / address unchanged
>> to
>> > clients. We perform a code change in three steps:
>> >
>> > 1. *Develop Step:* The client launches a new Flink job to update the
>> code
>> > from Flink SQL v1 to Flink SQL v2 with structure:  Kafka Inputs -> Flink
>> > SQL v2 -> TEMP Kafka. The new job will read the production inputs and
>> write
>> > into an auto-generated temporary Kafka topic so that there is no
>> pollution
>> > to the Flink SQL v1 job.
>> >
>> > 2. *Deploy Step*: When the client has tested thoroughly and thinks Flink
>> > SQL v2 is ready to be promoted to production (the completeness is judged
>> > manually by clients), the client can deploy Flink SQL v2 logic to
>> > production in one click. Behind the scene, the system will
>> automatically do
>> > the following actions in sequence:
>> >     2.1. The system will take a savepoint of Flink SQL v2 job which
>> > contains all its internal states.
>> >     2.2. The system will stop the Flink SQL v2 job and Flink SQL v1 job.
>> >     2.3. The system will create a new production ready job with
>> structure
>> >  Kafka Inputs -> Flink SQL v2 -> Kafka output from 2.1's savepoint.
>> >
>> > 3. *Backfill Step*: After Deploy Step is done, the Flink SQL v2 is
>> already
>> > in production and serves the latest traffic. It’s at the client’s
>> > discretion on when and how fast to perform a backfill to correct all the
>> > records.
>> >
>> >     3.1. Here we need a special form of backfill: For the Flink job,
>> given
>> > one key in the schema of <Kafka Output>, the backfill will 1) send a
>> > Refresh Record e.g. UPSERT <key, latest value> to clients if the key
>> exists
>> > in Flink states. 2) send a Delete Record e.g. DELETE<key, null> to
>> clients
>> > if the key doesn't exist in Flink states.
>> >     3.2. The system will backfill all the records of two sinks in Deploy
>> > Step <Kafka output> and <TEMP Kafka>. The backfill will either refresh
>> > client records’ states or clean up clients’ stale records.
>> >     3.3. After the backfill is completed, the <TEMP Kafka> will be
>> > destroyed automatically by the system.
>> >
>> > Please let us know your opinions on this approach.
>> >
>> > Regards
>> > Zhihao
>> >
>>
>

Reply via email to