Anish Mahto created SPARK-57222:
-----------------------------------

             Summary: Implement SCD2 Batch Processor; Decompose affected rows
                 Key: SPARK-57222
                 URL: https://issues.apache.org/jira/browse/SPARK-57222
             Project: Spark
          Issue Type: Sub-task
          Components: Declarative Pipelines
    Affects Versions: 4.3.0
            Reporter: Anish Mahto


{*}Preamble{*}:

The SCD type 2 flow is a foreachBatch streaming query on an input 
change-data-feed, and is responsible for reconciling the incoming change data 
onto some target table that follows SCD2 replication semantics.

SCD2 flows also maintain an "auxiliary" table to keep track of early-arriving 
out-of-order received events state. Each microbatch will need to reconcile 
against this auxiliary table as well, and update the auxiliary table's state 
appropriately for future microbatches.

 

*Decompose affected rows*

Given the set of affected rows in the current microbatch execution - incoming 
rows in the microbatch, affected rows from aux table, affected rows from target 
table - the first step in microbatch reconciliation is decomposing closed 
historical rows that are being bisected by the microbatch.

A closed historical row is a row in the target table that has a non-null 
start-at and end-at. It's possible an incoming upsert/delete in the microbatch 
lands with a sequence in between an existing closed row's start/end at (i.e is 
a late-arriving event), bisecting it.

Decomposing a closed row means exactly this - bisecting the closed interval 
into a left and right end point, called the decomposed head and tail of the 
original closed row respectively. The head represents some past upsert event, 
the tail represents some past delete event. 

Once a closed row is decomposed into its end points, it can either coalesce 
with other endpoints/events from the full set of affected rows to form a new 
historical row, or it can be demoted back to the aux table as a tombstone or 
no-op upsert.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to