Re: [PR] [SPARK-56686][SQL] Support streaming row-level CDC post-processing [spark]

via GitHub Thu, 30 Apr 2026 19:13:14 -0700


gengliangwang commented on code in PR #55636:
URL: https://github.com/apache/spark/pull/55636#discussion_r3171765294



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveChangelogTable.scala:
##########
@@ -197,6 +216,188 @@ object ResolveChangelogTable extends Rule[LogicalPlan] {
     removeHelperColumns(modifiedPlan)
   }
 
+  /**
+   * Streaming counterpart of [[addRowLevelPostProcessing]].
+   *
+   * ==Why a different shape from the batch path?==
+   *
+   * The batch rewrite is Window-based:
+   * {{{
+   *   DataSourceV2Relation
+   *     -> Window partitioned by (rowId..., _commit_version)
+   *     -> [Filter (carry-over)]
+   *     -> [Project (update relabel)]
+   *     -> Project (drop helper columns)
+   * }}}
+   * [[org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker]] 
rejects
+   * `Window` on streaming queries 
(`NON_TIME_WINDOW_NOT_SUPPORTED_IN_STREAMING`).
+   * Replacing it with a plain [[Aggregate]] is not enough on its own: an 
aggregate
+   * collapses each group to a single row, losing the per-input rows we still 
need to
+   * relabel/filter; and an append-mode streaming aggregate without an 
event-time
+   * watermark on a grouping key is itself rejected by the checker.
+   *
+   * ==The rewritten plan==
+   *
+   * Two adjustments over the naive substitution: (a) inject an 
[[EventTimeWatermark]]
+   * on `_commit_timestamp` (zero delay) so the aggregate is legal in append 
mode, and
+   * (b) buffer every input row of a group as `Inline`-able structs and 
re-explode after
+   * the aggregate so no rows are lost.
+   * {{{
+   *   DataSourceV2Relation
+   *     -> EventTimeWatermark(_commit_timestamp, 0s)
+   *     -> Aggregate
+   *          group by (rowId..., _commit_version, _commit_timestamp)
+   *          aggs    : _del_cnt, _ins_cnt
+   *                    [, _min_rv, _max_rv, _rv_cnt  (carry-over removal 
only)]
+   *                    , __spark_cdc_events = collect_list(struct(*))
+   *     -> [Filter (carry-over: _del_cnt=1 AND _ins_cnt=1
+   *                             AND _rv_cnt=2 AND _min_rv=_max_rv)]
+   *     -> Generate(Inline(__spark_cdc_events))   // re-emit one row per 
buffered input
+   *     -> [Project (update relabel)]
+   *     -> Project (drop helper columns)
+   * }}}
+   *
+   * ==Runtime walkthrough==
+   *
+   * Append-mode streaming aggregates emit a group when its event-time 
grouping key
+   * falls at or below the global watermark (eviction predicate `eventTime <= 
watermark`,
+   * applied at the start of the next micro-batch). Suppose three commits with
+   * `_commit_timestamp` 10, 20, 30 each arrive in their own micro-batch:
+   * {{{
+   *   batch  max _ts seen  watermark after batch  groups emitted by this batch
+   *   -----  ------------  ---------------------  ----------------------------
+   *     1         10                10            <none>
+   *     2         20                20            groups with 
_commit_timestamp == 10
+   *     3         30                30            groups with 
_commit_timestamp == 20
+   *   end-of-stream final flush                   groups with 
_commit_timestamp == 30
+   * }}}
+   * Because every row of a single commit shares the same `_commit_timestamp` 
(CDC
+   * contract), advancing past commit T releases every group whose grouping
+   * `_commit_timestamp` equals T -- one commit's worth of post-processed 
output per
+   * micro-batch, with the final commit flushed on stream termination.
+   *
+   * ==Per-operator detail==
+   *
+   *  1. [[EventTimeWatermark]] on `_commit_timestamp` (zero delay) -- 
required so the
+   *     downstream stateful aggregate can emit groups in append output mode. 
By CDC
+   *     contract every row in a single commit shares `_commit_timestamp`, so 
taking it
+   *     as event time is safe. Note: this is currently the only analyzer rule 
that
+   *     auto-injects an [[EventTimeWatermark]] (others resolve user-supplied 
watermarks).
+   *     The watermark metadata is preserved on the user-visible 
`_commit_timestamp`
+   *     output (since [[Generate]]'s `generatorOutput` copies attribute 
metadata), so a
+   *     downstream user-supplied `withWatermark` on a different column will 
interact
+   *     with this internal watermark under the global multi-watermark policy.
+   *  2. [[Aggregate]] keyed by `(rowId..., _commit_version, 
_commit_timestamp)`. Computes
+   *     the same `_del_cnt` / `_ins_cnt` / (`_min_rv` / `_max_rv` / 
`_rv_cnt`) helpers as
+   *     the batch path, plus an `__spark_cdc_events` array-of-struct 
buffering every
+   *     input row of the group. `_commit_timestamp` is included in the 
grouping keys
+   *     (besides being a no-op given the contract) to satisfy
+   *     
[[org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker]]'s
+   *     requirement that the watermark attribute appear among grouping 
expressions for
+   *     append-mode streaming aggregations.
+   *  3. [[Filter]] (only when carry-over removal is requested) on the same 
predicate as
+   *     the batch path -- groups with `_del_cnt = 1 AND _ins_cnt = 1 AND 
_rv_cnt = 2 AND
+   *     _min_rv = _max_rv` are dropped wholesale.
+   *  4. [[Generate]] using `Inline(events)` to re-emit one output row per 
buffered input
+   *     row. `unrequiredChildIndex` drops the duplicate grouping columns and 
the events
+   *     buffer; the helper count columns flow through.
+   *  5. [[Project]] (only when update detection is requested) applying the 
same
+   *     
`CHANGELOG_CONTRACT_VIOLATION.UNEXPECTED_MULTIPLE_CHANGES_PER_ROW_VERSION`
+   *     guard and `_change_type` relabel as the batch path.
+   *  6. Final [[Project]] (via [[removeHelperColumns]]) drops `__spark_cdc_*` 
helpers so
+   *     the output schema matches the connector's declared schema.
+   */
+  private def addStreamingRowLevelPostProcessing(

Review Comment:
   Good catch. Implemented in dee5e84 — added a CDC-specific case in 
`UnsupportedOperationChecker.checkForStreaming` that detects the rewrite by the 
`__spark_cdc_events` helper aggregate expression and throws 
`STREAMING_OUTPUT_MODE.UNSUPPORTED_OPERATION` for non-Append modes. Negative 
end-to-end tests added in `ChangelogEndToEndSuite` for both Update and Complete 
output modes.



##########
sql/core/src/test/scala/org/apache/spark/sql/connector/ChangelogEndToEndSuite.scala:
##########
@@ -662,4 +663,143 @@ class ChangelogEndToEndSuite extends SharedSparkSession {
     }
     assert(e.getMessage.contains("changes"))
   }
+
+  // ---------- Streaming: row-level post-processing ----------
+  //
+  // Streaming row-level passes (carry-over removal, update detection) rewrite 
the plan
+  // into Aggregate(rowId, _commit_version, _commit_timestamp) -> [Filter] ->
+  // Generate(Inline(events)) -> [relabel Project], under an 
EventTimeWatermark on
+  // _commit_timestamp.
+
+  /** Schema variant for post-processing tests: includes `row_commit_version`. 
*/
+  private def recreateWithRowVersion(): Identifier = {
+    val id = ident
+    val cat = catalog
+    if (cat.tableExists(id)) cat.dropTable(id)
+    cat.createTable(
+      id,
+      Array(
+        Column.create("id", LongType, false),
+        Column.create("data", StringType),
+        Column.create("row_commit_version", LongType, false)),
+      Array.empty,
+      new util.HashMap[String, String]())
+    cat.clearChangeRows(id)
+    id
+  }
+
+  /** Row constructor for the row-version-enabled schema. */
+  private def ppRow(
+      id: Long,
+      data: String,
+      rcv: Long,
+      changeType: String,
+      commitVersion: Long,
+      commitTimestampMicros: Long): InternalRow = {
+    InternalRow(
+      id,
+      UTF8String.fromString(data),
+      rcv,
+      UTF8String.fromString(changeType),
+      commitVersion,
+      commitTimestampMicros)
+  }
+
+  test("streaming carry-over removal drops CoW pairs") {
+    val id = recreateWithRowVersion()
+    catalog.setChangelogProperties(id, ChangelogProperties(
+      containsCarryoverRows = true,
+      rowIdNames = Seq("id"),
+      rowVersionName = Some("row_commit_version")))
+
+    catalog.addChangeRows(id, Seq(
+      // v1: insert Alice (rcv=1), Bob (rcv=1)
+      ppRow(1L, "Alice", 1L, CHANGE_TYPE_INSERT, 1L, 1000000L),
+      ppRow(2L, "Bob",   1L, CHANGE_TYPE_INSERT, 1L, 1000000L),
+      // v2: real delete Alice + carry-over for Bob (rcv unchanged)
+      ppRow(1L, "Alice", 1L, CHANGE_TYPE_DELETE, 2L, 2000000L),
+      ppRow(2L, "Bob",   1L, CHANGE_TYPE_DELETE, 2L, 2000000L),
+      ppRow(2L, "Bob",   1L, CHANGE_TYPE_INSERT, 2L, 2000000L)))
+
+    val q = spark.readStream
+      .option("startingVersion", "1")
+      .changes(fullTableName)
+      .select("id", "data", "_change_type", "_commit_version")
+      .writeStream
+      .format("memory")
+      .queryName("cdc_stream_carryover")
+      .outputMode("append")

Review Comment:
   Done in dee5e84 — instead of behavioral tests, the rewrite now explicitly 
rejects Update / Complete output modes (see the reply on 
`addStreamingRowLevelPostProcessing` above), so the new tests are negative ones 
asserting `STREAMING_OUTPUT_MODE.UNSUPPORTED_OPERATION` is raised at 
writer-start time. The error message names "Change Data Capture (CDC) streaming 
reads with post-processing" so the failure mode is discoverable.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-56686][SQL] Support streaming row-level CDC post-processing [spark]

Reply via email to