srowen closed pull request #22238: [SPARK-25245][DOCS][SS] Explain regarding 
limiting modification on "spark.sql.shuffle.partitions" for structured streaming
URL: https://github.com/apache/spark/pull/22238
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/docs/structured-streaming-programming-guide.md 
b/docs/structured-streaming-programming-guide.md
index 73de1892977ac..8c3622c857240 100644
--- a/docs/structured-streaming-programming-guide.md
+++ b/docs/structured-streaming-programming-guide.md
@@ -2812,6 +2812,16 @@ See [Input Sources](#input-sources) and [Output 
Sinks](#output-sinks) sections f
 
 # Additional Information
 
+**Notes**
+
+- Several configurations are not modifiable after the query has run. To change 
them, discard the checkpoint and start a new query. These configurations 
include:
+  - `spark.sql.shuffle.partitions`
+    - This is due to the physical partitioning of state: state is partitioned 
via applying hash function to key, hence the number of partitions for state 
should be unchanged.
+    - If you want to run fewer tasks for stateful operations, `coalesce` would 
help with avoiding unnecessary repartitioning.
+      - After `coalesce`, the number of (reduced) tasks will be kept unless 
another shuffle happens.
+  - `spark.sql.streaming.stateStore.providerClass`: To read the previous state 
of the query properly, the class of state store provider should be unchanged.
+  - `spark.sql.streaming.multipleWatermarkPolicy`: Modification of this would 
lead inconsistent watermark value when query contains multiple watermarks, 
hence the policy should be unchanged.
+
 **Further Reading**
 
 - See and run the
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
index ef3ce98fd7add..ca100da9f019c 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
@@ -266,7 +266,9 @@ object SQLConf {
     .createWithDefault(Long.MaxValue)
 
   val SHUFFLE_PARTITIONS = buildConf("spark.sql.shuffle.partitions")
-    .doc("The default number of partitions to use when shuffling data for 
joins or aggregations.")
+    .doc("The default number of partitions to use when shuffling data for 
joins or aggregations. " +
+      "Note: For structured streaming, this configuration cannot be changed 
between query " +
+      "restarts from the same checkpoint location.")
     .intConf
     .createWithDefault(200)
 
@@ -868,7 +870,9 @@ object SQLConf {
       .internal()
       .doc(
         "The class used to manage state data in stateful streaming queries. 
This class must " +
-          "be a subclass of StateStoreProvider, and must have a zero-arg 
constructor.")
+          "be a subclass of StateStoreProvider, and must have a zero-arg 
constructor. " +
+          "Note: For structured streaming, this configuration cannot be 
changed between query " +
+          "restarts from the same checkpoint location.")
       .stringConf
       .createWithDefault(
         
"org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider")


 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to