(spark) branch master updated: [SPARK-50264][SQL][CONNECT] Add missing methods to DataStreamWriter

hvanhovell Sun, 10 Nov 2024 14:00:43 -0800

This is an automated email from the ASF dual-hosted git repository.

hvanhovell pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new 6fd77b1e7f5b [SPARK-50264][SQL][CONNECT] Add missing methods to 
DataStreamWriter
6fd77b1e7f5b is described below

commit 6fd77b1e7f5ba45e13069b747d206043d25e62e0
Author: Herman van Hovell <[email protected]>
AuthorDate: Sun Nov 10 18:00:02 2024 -0400

    [SPARK-50264][SQL][CONNECT] Add missing methods to DataStreamWriter
    
    ### What changes were proposed in this pull request?
    We missed a couple of methods when we introduced the `DataStreamWriter` 
interface. This PR adds them back.
    
    ### Why are the changes needed?
    `DataStreamWriter` interface must have all user facing methods.
    
    ### Does this PR introduce _any_ user-facing change?
    No.
    
    ### How was this patch tested?
    Existing tests.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No.
    
    Closes #48796 from hvanhovell/SPARK-50264.
    
    Authored-by: Herman van Hovell <[email protected]>
    Signed-off-by: Herman van Hovell <[email protected]>
---
 .../apache/spark/sql/api/DataStreamWriter.scala    | 37 ++++++++++++++++++++++
 1 file changed, 37 insertions(+)

diff --git 
a/sql/api/src/main/scala/org/apache/spark/sql/api/DataStreamWriter.scala 
b/sql/api/src/main/scala/org/apache/spark/sql/api/DataStreamWriter.scala
index 7762708e9520..f627eb3e167a 100644
--- a/sql/api/src/main/scala/org/apache/spark/sql/api/DataStreamWriter.scala
+++ b/sql/api/src/main/scala/org/apache/spark/sql/api/DataStreamWriter.scala
@@ -93,6 +93,43 @@ abstract class DataStreamWriter[T] extends 
WriteConfigMethods[DataStreamWriter[T
    */
   def queryName(queryName: String): this.type
 
+  /**
+   * Specifies the underlying output data source.
+   *
+   * @since 2.0.0
+   */
+  def format(source: String): this.type
+
+  /**
+   * Partitions the output by the given columns on the file system. If 
specified, the output is
+   * laid out on the file system similar to Hive's partitioning scheme. As an 
example, when we
+   * partition a dataset by year and then month, the directory layout would 
look like:
+   *
+   * <ul> <li> year=2016/month=01/</li> <li> year=2016/month=02/</li> </ul>
+   *
+   * Partitioning is one of the most widely used techniques to optimize 
physical data layout. It
+   * provides a coarse-grained index for skipping unnecessary data reads when 
queries have
+   * predicates on the partitioned columns. In order for partitioning to work 
well, the number of
+   * distinct values in each column should typically be less than tens of 
thousands.
+   *
+   * @since 2.0.0
+   */
+  @scala.annotation.varargs
+  def partitionBy(colNames: String*): this.type
+
+  /**
+   * Clusters the output by the given columns. If specified, the output is 
laid out such that
+   * records with similar values on the clustering column are grouped together 
in the same file.
+   *
+   * Clustering improves query efficiency by allowing queries with predicates 
on the clustering
+   * columns to skip unnecessary data. Unlike partitioning, clustering can be 
used on very high
+   * cardinality columns.
+   *
+   * @since 4.0.0
+   */
+  @scala.annotation.varargs
+  def clusterBy(colNames: String*): this.type
+
   /**
    * Sets the output of the streaming query to be processed using the provided 
writer object.
    * object. See [[org.apache.spark.sql.ForeachWriter]] for more details on 
the lifecycle and


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(spark) branch master updated: [SPARK-50264][SQL][CONNECT] Add missing methods to DataStreamWriter

Reply via email to