Re: [PR] [SPARK-48761][SQL] Introduce clusterBy DataFrameWriter API for Scala [spark]

via GitHub Mon, 22 Jul 2024 08:57:58 -0700


zedtang commented on code in PR #47301:
URL: https://github.com/apache/spark/pull/47301#discussion_r1686736484



##########
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala:
##########
@@ -201,6 +201,22 @@ final class DataFrameWriter[T] private[sql] (ds: 
Dataset[T]) {
     this
   }
 
+  /**
+   * Clusters the output by the given columns on the file system. The rows 
with matching values in

Review Comment:
   sure, updated here and below



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala:
##########
@@ -209,10 +209,25 @@ object ClusterBySpec {
       normalizeClusterBySpec(schema, clusterBySpec, resolver).toJson
   }
 
+  /**
+   * Converts a ClusterBySpec to a map of table properties used to store the 
clustering
+   * information in the table catalog.
+   *
+   * @param clusterBySpec : existing ClusterBySpec to be converted to 
properties.
+   */
+  def toProperties(clusterBySpec: ClusterBySpec): Map[String, String] = {

Review Comment:
   Besides the different return type, `toProperty` additionally does validation 
of the clustering columns against table schema, therefore it has 2 more input 
parameters (`schema` and `resovler`).
   
   I updated the comments.



##########
sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriterV2.scala:
##########
@@ -104,9 +106,27 @@ final class DataFrameWriterV2[T] private[sql](table: 
String, ds: Dataset[T])
     }
 
     this.partitioning = Some(asTransforms)
+    validatePartitioning()
+    this
+  }
+
+  @scala.annotation.varargs
+  override def clusterBy(colName: String, colNames: String*): 
CreateTableWriter[T] = {
+    this.clustering =
+      Some(ClusterByTransform((colName +: colNames).map(col => 
FieldReference(col))))
+    validatePartitioning()
     this
   }
 
+  /**
+   * Validate that clusterBy is not used with partitionBy.
+   */
+  private def validatePartitioning(): Unit = {
+    if (partitioning.nonEmpty && clustering.nonEmpty) {

Review Comment:
   For DFW V2, 
[bucketBy](https://github.com/apache/spark/blob/3af2ea973c458b9bd9818d6af733683fb15fbc19/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriterV2.scala#L98)
 is implemented as a special partitionedBy transform, therefore I didn't call 
out here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48761][SQL] Introduce clusterBy DataFrameWriter API for Scala [spark]

Reply via email to