Re: [PR] [SPARK-56838][SDP] Introduce AutoCDC parameters dataclass [spark]

via GitHub Wed, 13 May 2026 06:39:18 -0700


szehon-ho commented on code in PR #55836:
URL: https://github.com/apache/spark/pull/55836#discussion_r3234416564



##########
sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/autocdc/ChangeArgs.scala:
##########
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.pipelines.autocdc
+
+import org.apache.spark.sql.{AnalysisException, Column}
+import org.apache.spark.sql.types.StructType
+
+sealed trait ColumnSelection
+object ColumnSelection {
+  type ColumnList = Seq[String]
+  case class IncludeColumns(columns: ColumnList) extends ColumnSelection
+  case class ExcludeColumns(columns: ColumnList) extends ColumnSelection
+
+  /**
+   * Applies [[ColumnSelection]] to a [[StructType]] and returns the filtered 
schema.
+   * Field names are matched exactly. Field order follows the original schema 
(filtered in place).
+   */
+  def applyToSchema(schema: StructType, columnSelection: 
Option[ColumnSelection]): StructType =
+    columnSelection match {
+      case None =>
+        // A none column selection is interpreted as a no-op.
+        schema
+      case Some(IncludeColumns(includeColumns)) =>
+        validateColumnsExistInSchema(columns = includeColumns, schema = schema)
+
+        val includeColumnSet = includeColumns.toSet
+        StructType(schema.fields.filter(f => 
includeColumnSet.contains(f.name)))
+      case Some(ExcludeColumns(excludeColumns)) =>
+        validateColumnsExistInSchema(columns = excludeColumns, schema = schema)
+
+        val excludeColumnSet = excludeColumns.toSet
+        StructType(schema.fields.filterNot(f => 
excludeColumnSet.contains(f.name)))
+    }
+
+  private def validateColumnsExistInSchema(columns: ColumnList, schema: 
StructType): Unit = {
+    val schemaColumns = schema.fieldNames.toSet
+    val missingColumns = columns.filterNot(schemaColumns.contains).distinct
+    if (missingColumns.nonEmpty) {
+      throw new AnalysisException(
+        errorClass = "AUTOCDC_INVALID_COLUMN_SELECTION.COLUMNS_NOT_FOUND",
+        messageParameters = Map(
+          "missingColumns" -> missingColumns.mkString(", "),
+          "availableColumns" -> schema.fieldNames.mkString(", ")
+        ))
+    }
+  }
+}
+
+/** The SCD (Slowly Changing Dimension) strategy for a CDC flow. */
+sealed trait ScdType
+
+object ScdType {
+  case object Type1 extends ScdType
+  case object Type2 extends ScdType
+}
+
+/**
+ * Configuration for an AutoCDC flow.
+ *
+ * @param keys            The column(s) that uniquely identify a row in the 
source data.
+ * @param sequencing      Expression ordering CDC events to correctly resolve 
out-of-order
+ *                        arrivals. Must be a sortable type.
+ * @param deleteCondition Expression that marks a source row as a DELETE. When 
None, all
+ *                        rows are treated as upserts.
+ * @param storedAsScdType The SCD strategy these args should be applied to.
+ * @param columnSelection Which source columns to include in the target table. 
None means
+ *                        all columns.
+ */
+case class ChangeArgs(

Review Comment:
   ChangeArgs currently places deleteCondition: Option[Column] = None (a 
parameter with a default) before storedAsScdType: ScdType (a parameter without 
a default). On Scala 2.13 (Spark’s Scala version), that violates the rule that 
every parameter after the first defaulted one must also have a default, so this 
case class should not compile as written.
   
   Please reorder so all required fields come before any defaulted fields — for 
example: keys, sequencing, storedAsScdType, then deleteCondition and 
columnSelection with defaults — and update any call sites accordingly.



##########
sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/autocdc/ChangeArgs.scala:
##########
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.pipelines.autocdc
+
+import org.apache.spark.sql.{AnalysisException, Column}
+import org.apache.spark.sql.types.StructType
+
+sealed trait ColumnSelection
+object ColumnSelection {
+  type ColumnList = Seq[String]
+  case class IncludeColumns(columns: ColumnList) extends ColumnSelection
+  case class ExcludeColumns(columns: ColumnList) extends ColumnSelection
+
+  /**
+   * Applies [[ColumnSelection]] to a [[StructType]] and returns the filtered 
schema.
+   * Field names are matched exactly. Field order follows the original schema 
(filtered in place).
+   */
+  def applyToSchema(schema: StructType, columnSelection: 
Option[ColumnSelection]): StructType =
+    columnSelection match {
+      case None =>
+        // A none column selection is interpreted as a no-op.
+        schema
+      case Some(IncludeColumns(includeColumns)) =>

Review Comment:
   [Opt] style: its a bit lengthy, i would personally omit the key=value 
assignment and just put values



##########
sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/autocdc/ChangeArgs.scala:
##########
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.pipelines.autocdc
+
+import org.apache.spark.sql.{AnalysisException, Column}
+import org.apache.spark.sql.types.StructType
+
+sealed trait ColumnSelection
+object ColumnSelection {
+  type ColumnList = Seq[String]

Review Comment:
   to check, we do not handle nested?



##########
sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/autocdc/ChangeArgs.scala:
##########
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.pipelines.autocdc
+
+import org.apache.spark.sql.{AnalysisException, Column}
+import org.apache.spark.sql.types.StructType
+
+sealed trait ColumnSelection
+object ColumnSelection {
+  type ColumnList = Seq[String]
+  case class IncludeColumns(columns: ColumnList) extends ColumnSelection
+  case class ExcludeColumns(columns: ColumnList) extends ColumnSelection
+
+  /**
+   * Applies [[ColumnSelection]] to a [[StructType]] and returns the filtered 
schema.
+   * Field names are matched exactly. Field order follows the original schema 
(filtered in place).
+   */
+  def applyToSchema(schema: StructType, columnSelection: 
Option[ColumnSelection]): StructType =
+    columnSelection match {
+      case None =>
+        // A none column selection is interpreted as a no-op.
+        schema
+      case Some(IncludeColumns(includeColumns)) =>
+        validateColumnsExistInSchema(columns = includeColumns, schema = schema)

Review Comment:
   also checking, if columns is empty, this creates an empty structtype, is it 
expected?



##########
sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/autocdc/ChangeArgs.scala:
##########
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.pipelines.autocdc
+
+import org.apache.spark.sql.{AnalysisException, Column}
+import org.apache.spark.sql.types.StructType
+
+sealed trait ColumnSelection
+object ColumnSelection {
+  type ColumnList = Seq[String]
+  case class IncludeColumns(columns: ColumnList) extends ColumnSelection
+  case class ExcludeColumns(columns: ColumnList) extends ColumnSelection
+
+  /**
+   * Applies [[ColumnSelection]] to a [[StructType]] and returns the filtered 
schema.
+   * Field names are matched exactly. Field order follows the original schema 
(filtered in place).
+   */
+  def applyToSchema(schema: StructType, columnSelection: 
Option[ColumnSelection]): StructType =
+    columnSelection match {
+      case None =>
+        // A none column selection is interpreted as a no-op.
+        schema
+      case Some(IncludeColumns(includeColumns)) =>
+        validateColumnsExistInSchema(columns = includeColumns, schema = schema)
+
+        val includeColumnSet = includeColumns.toSet
+        StructType(schema.fields.filter(f => 
includeColumnSet.contains(f.name)))
+      case Some(ExcludeColumns(excludeColumns)) =>
+        validateColumnsExistInSchema(columns = excludeColumns, schema = schema)
+
+        val excludeColumnSet = excludeColumns.toSet
+        StructType(schema.fields.filterNot(f => 
excludeColumnSet.contains(f.name)))
+    }
+
+  private def validateColumnsExistInSchema(columns: ColumnList, schema: 
StructType): Unit = {
+    val schemaColumns = schema.fieldNames.toSet
+    val missingColumns = columns.filterNot(schemaColumns.contains).distinct
+    if (missingColumns.nonEmpty) {
+      throw new AnalysisException(
+        errorClass = "AUTOCDC_INVALID_COLUMN_SELECTION.COLUMNS_NOT_FOUND",
+        messageParameters = Map(
+          "missingColumns" -> missingColumns.mkString(", "),
+          "availableColumns" -> schema.fieldNames.mkString(", ")
+        ))
+    }
+  }
+}
+
+/** The SCD (Slowly Changing Dimension) strategy for a CDC flow. */
+sealed trait ScdType
+
+object ScdType {
+  case object Type1 extends ScdType

Review Comment:
   should we scaladoc type1, type2?



##########
sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/autocdc/ChangeArgs.scala:
##########
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.pipelines.autocdc
+
+import org.apache.spark.sql.{AnalysisException, Column}
+import org.apache.spark.sql.types.StructType
+
+sealed trait ColumnSelection
+object ColumnSelection {
+  type ColumnList = Seq[String]
+  case class IncludeColumns(columns: ColumnList) extends ColumnSelection
+  case class ExcludeColumns(columns: ColumnList) extends ColumnSelection
+
+  /**
+   * Applies [[ColumnSelection]] to a [[StructType]] and returns the filtered 
schema.
+   * Field names are matched exactly. Field order follows the original schema 
(filtered in place).
+   */
+  def applyToSchema(schema: StructType, columnSelection: 
Option[ColumnSelection]): StructType =
+    columnSelection match {
+      case None =>
+        // A none column selection is interpreted as a no-op.
+        schema
+      case Some(IncludeColumns(includeColumns)) =>
+        validateColumnsExistInSchema(columns = includeColumns, schema = schema)

Review Comment:
   and would it be better to pass in the includeColumnSet , to make the 
validate easier?



##########
sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/autocdc/ChangeArgs.scala:
##########
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.pipelines.autocdc
+
+import org.apache.spark.sql.{AnalysisException, Column}
+import org.apache.spark.sql.types.StructType
+
+sealed trait ColumnSelection
+object ColumnSelection {
+  type ColumnList = Seq[String]
+  case class IncludeColumns(columns: ColumnList) extends ColumnSelection
+  case class ExcludeColumns(columns: ColumnList) extends ColumnSelection
+
+  /**
+   * Applies [[ColumnSelection]] to a [[StructType]] and returns the filtered 
schema.
+   * Field names are matched exactly. Field order follows the original schema 
(filtered in place).
+   */
+  def applyToSchema(schema: StructType, columnSelection: 
Option[ColumnSelection]): StructType =
+    columnSelection match {
+      case None =>
+        // A none column selection is interpreted as a no-op.
+        schema
+      case Some(IncludeColumns(includeColumns)) =>
+        validateColumnsExistInSchema(columns = includeColumns, schema = schema)
+
+        val includeColumnSet = includeColumns.toSet
+        StructType(schema.fields.filter(f => 
includeColumnSet.contains(f.name)))
+      case Some(ExcludeColumns(excludeColumns)) =>
+        validateColumnsExistInSchema(columns = excludeColumns, schema = schema)
+
+        val excludeColumnSet = excludeColumns.toSet
+        StructType(schema.fields.filterNot(f => 
excludeColumnSet.contains(f.name)))
+    }
+
+  private def validateColumnsExistInSchema(columns: ColumnList, schema: 
StructType): Unit = {
+    val schemaColumns = schema.fieldNames.toSet
+    val missingColumns = columns.filterNot(schemaColumns.contains).distinct
+    if (missingColumns.nonEmpty) {
+      throw new AnalysisException(
+        errorClass = "AUTOCDC_INVALID_COLUMN_SELECTION.COLUMNS_NOT_FOUND",
+        messageParameters = Map(
+          "missingColumns" -> missingColumns.mkString(", "),
+          "availableColumns" -> schema.fieldNames.mkString(", ")
+        ))
+    }
+  }
+}
+
+/** The SCD (Slowly Changing Dimension) strategy for a CDC flow. */
+sealed trait ScdType
+
+object ScdType {
+  case object Type1 extends ScdType
+  case object Type2 extends ScdType
+}
+
+/**
+ * Configuration for an AutoCDC flow.
+ *
+ * @param keys            The column(s) that uniquely identify a row in the 
source data.
+ * @param sequencing      Expression ordering CDC events to correctly resolve 
out-of-order
+ *                        arrivals. Must be a sortable type.
+ * @param deleteCondition Expression that marks a source row as a DELETE. When 
None, all
+ *                        rows are treated as upserts.
+ * @param storedAsScdType The SCD strategy these args should be applied to.
+ * @param columnSelection Which source columns to include in the target table. 
None means

Review Comment:
   how about 'to include' => 'to select'.  as its both include/exclude from 
what i see



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-56838][SDP] Introduce AutoCDC parameters dataclass [spark]

Reply via email to