Re: [PR] [SPARK-55317][SQL] Add SequentialUnion logical plan node and planning rule [spark]

via GitHub Thu, 05 Feb 2026 16:08:27 -0800


ericm-db commented on code in PR #54098:
URL: https://github.com/apache/spark/pull/54098#discussion_r2771673777



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala:
##########
@@ -633,6 +633,86 @@ case class Union(
     copy(children = newChildren)
 }
 
+/**
+ * Logical plan for unioning multiple plans sequentially, processing each 
child to completion
+ * before moving to the next. This is used for backfill-to-live streaming 
scenarios where
+ * historical data should be processed completely before switching to live 
data.
+ *
+ * Unlike [[Union]] which processes all children concurrently in streaming 
queries, SequentialUnion
+ * processes each child source sequentially:
+ * 1. First child processes until complete (bounded sources reach their end)
+ * 2. Second child begins processing
+ * 3. And so on...
+ *
+ * Requirements:
+ * - Minimum 2 children required
+ * - All children must be streaming sources
+ * - All non-final children must support bounded execution 
(SupportsTriggerAvailableNow)
+ * - All children must have explicit names when used in streaming queries
+ * - Children cannot contain stateful operations (aggregations, joins, etc.)
+ * - Schema compatibility is enforced via UnionBase
+ *
+ * State preservation: Stateful operators applied AFTER SequentialUnion 
(aggregations,
+ * watermarks, deduplication, joins) preserve their state across source 
transitions,
+ * enabling seamless backfill-to-live scenarios.
+ *
+ * Example:
+ * {{{
+ *   val historical = 
spark.readStream.format("delta").name("historical").load("/data")
+ *   val live = spark.readStream.format("kafka").name("live").load()
+ *   // Correct: stateful operations after SequentialUnion
+ *   historical.followedBy(live).groupBy("key").count()
+ *
+ *   // Incorrect: stateful operations before SequentialUnion
+ *   // 
historical.groupBy("key").count().followedBy(live.groupBy("key").count()) // 
Not allowed
+ * }}}
+ *
+ * @param children        The logical plans to union sequentially (must be 
streaming sources)
+ * @param byName          Whether to resolve columns by name
+ * @param allowMissingCol Whether to allow missing columns in children

Review Comment:
   added some explanation



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-55317][SQL] Add SequentialUnion logical plan node and planning rule [spark]

Reply via email to