This is an automated email from the ASF dual-hosted git repository.
hvanhovell pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push:
new 6fd77b1e7f5b [SPARK-50264][SQL][CONNECT] Add missing methods to
DataStreamWriter
6fd77b1e7f5b is described below
commit 6fd77b1e7f5ba45e13069b747d206043d25e62e0
Author: Herman van Hovell <[email protected]>
AuthorDate: Sun Nov 10 18:00:02 2024 -0400
[SPARK-50264][SQL][CONNECT] Add missing methods to DataStreamWriter
### What changes were proposed in this pull request?
We missed a couple of methods when we introduced the `DataStreamWriter`
interface. This PR adds them back.
### Why are the changes needed?
`DataStreamWriter` interface must have all user facing methods.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing tests.
### Was this patch authored or co-authored using generative AI tooling?
No.
Closes #48796 from hvanhovell/SPARK-50264.
Authored-by: Herman van Hovell <[email protected]>
Signed-off-by: Herman van Hovell <[email protected]>
---
.../apache/spark/sql/api/DataStreamWriter.scala | 37 ++++++++++++++++++++++
1 file changed, 37 insertions(+)
diff --git
a/sql/api/src/main/scala/org/apache/spark/sql/api/DataStreamWriter.scala
b/sql/api/src/main/scala/org/apache/spark/sql/api/DataStreamWriter.scala
index 7762708e9520..f627eb3e167a 100644
--- a/sql/api/src/main/scala/org/apache/spark/sql/api/DataStreamWriter.scala
+++ b/sql/api/src/main/scala/org/apache/spark/sql/api/DataStreamWriter.scala
@@ -93,6 +93,43 @@ abstract class DataStreamWriter[T] extends
WriteConfigMethods[DataStreamWriter[T
*/
def queryName(queryName: String): this.type
+ /**
+ * Specifies the underlying output data source.
+ *
+ * @since 2.0.0
+ */
+ def format(source: String): this.type
+
+ /**
+ * Partitions the output by the given columns on the file system. If
specified, the output is
+ * laid out on the file system similar to Hive's partitioning scheme. As an
example, when we
+ * partition a dataset by year and then month, the directory layout would
look like:
+ *
+ * <ul> <li> year=2016/month=01/</li> <li> year=2016/month=02/</li> </ul>
+ *
+ * Partitioning is one of the most widely used techniques to optimize
physical data layout. It
+ * provides a coarse-grained index for skipping unnecessary data reads when
queries have
+ * predicates on the partitioned columns. In order for partitioning to work
well, the number of
+ * distinct values in each column should typically be less than tens of
thousands.
+ *
+ * @since 2.0.0
+ */
+ @scala.annotation.varargs
+ def partitionBy(colNames: String*): this.type
+
+ /**
+ * Clusters the output by the given columns. If specified, the output is
laid out such that
+ * records with similar values on the clustering column are grouped together
in the same file.
+ *
+ * Clustering improves query efficiency by allowing queries with predicates
on the clustering
+ * columns to skip unnecessary data. Unlike partitioning, clustering can be
used on very high
+ * cardinality columns.
+ *
+ * @since 4.0.0
+ */
+ @scala.annotation.varargs
+ def clusterBy(colNames: String*): this.type
+
/**
* Sets the output of the streaming query to be processed using the provided
writer object.
* object. See [[org.apache.spark.sql.ForeachWriter]] for more details on
the lifecycle and
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]