ericm-db commented on code in PR #54098:
URL: https://github.com/apache/spark/pull/54098#discussion_r2771673777
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala:
##########
@@ -633,6 +633,86 @@ case class Union(
copy(children = newChildren)
}
+/**
+ * Logical plan for unioning multiple plans sequentially, processing each
child to completion
+ * before moving to the next. This is used for backfill-to-live streaming
scenarios where
+ * historical data should be processed completely before switching to live
data.
+ *
+ * Unlike [[Union]] which processes all children concurrently in streaming
queries, SequentialUnion
+ * processes each child source sequentially:
+ * 1. First child processes until complete (bounded sources reach their end)
+ * 2. Second child begins processing
+ * 3. And so on...
+ *
+ * Requirements:
+ * - Minimum 2 children required
+ * - All children must be streaming sources
+ * - All non-final children must support bounded execution
(SupportsTriggerAvailableNow)
+ * - All children must have explicit names when used in streaming queries
+ * - Children cannot contain stateful operations (aggregations, joins, etc.)
+ * - Schema compatibility is enforced via UnionBase
+ *
+ * State preservation: Stateful operators applied AFTER SequentialUnion
(aggregations,
+ * watermarks, deduplication, joins) preserve their state across source
transitions,
+ * enabling seamless backfill-to-live scenarios.
+ *
+ * Example:
+ * {{{
+ * val historical =
spark.readStream.format("delta").name("historical").load("/data")
+ * val live = spark.readStream.format("kafka").name("live").load()
+ * // Correct: stateful operations after SequentialUnion
+ * historical.followedBy(live).groupBy("key").count()
+ *
+ * // Incorrect: stateful operations before SequentialUnion
+ * //
historical.groupBy("key").count().followedBy(live.groupBy("key").count()) //
Not allowed
+ * }}}
+ *
+ * @param children The logical plans to union sequentially (must be
streaming sources)
+ * @param byName Whether to resolve columns by name
+ * @param allowMissingCol Whether to allow missing columns in children
Review Comment:
added some explanation
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]