Anish Mahto created SPARK-57222:
-----------------------------------
Summary: Implement SCD2 Batch Processor; Decompose affected rows
Key: SPARK-57222
URL: https://issues.apache.org/jira/browse/SPARK-57222
Project: Spark
Issue Type: Sub-task
Components: Declarative Pipelines
Affects Versions: 4.3.0
Reporter: Anish Mahto
{*}Preamble{*}:
The SCD type 2 flow is a foreachBatch streaming query on an input
change-data-feed, and is responsible for reconciling the incoming change data
onto some target table that follows SCD2 replication semantics.
SCD2 flows also maintain an "auxiliary" table to keep track of early-arriving
out-of-order received events state. Each microbatch will need to reconcile
against this auxiliary table as well, and update the auxiliary table's state
appropriately for future microbatches.
*Decompose affected rows*
Given the set of affected rows in the current microbatch execution - incoming
rows in the microbatch, affected rows from aux table, affected rows from target
table - the first step in microbatch reconciliation is decomposing closed
historical rows that are being bisected by the microbatch.
A closed historical row is a row in the target table that has a non-null
start-at and end-at. It's possible an incoming upsert/delete in the microbatch
lands with a sequence in between an existing closed row's start/end at (i.e is
a late-arriving event), bisecting it.
Decomposing a closed row means exactly this - bisecting the closed interval
into a left and right end point, called the decomposed head and tail of the
original closed row respectively. The head represents some past upsert event,
the tail represents some past delete event.
Once a closed row is decomposed into its end points, it can either coalesce
with other endpoints/events from the full set of affected rows to form a new
historical row, or it can be demoted back to the aux table as a tombstone or
no-op upsert.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]