srowen closed pull request #22238: [SPARK-25245][DOCS][SS] Explain regarding
limiting modification on "spark.sql.shuffle.partitions" for structured streaming
URL: https://github.com/apache/spark/pull/22238
This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:
As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):
diff --git a/docs/structured-streaming-programming-guide.md
b/docs/structured-streaming-programming-guide.md
index 73de1892977ac..8c3622c857240 100644
--- a/docs/structured-streaming-programming-guide.md
+++ b/docs/structured-streaming-programming-guide.md
@@ -2812,6 +2812,16 @@ See [Input Sources](#input-sources) and [Output
Sinks](#output-sinks) sections f
# Additional Information
+**Notes**
+
+- Several configurations are not modifiable after the query has run. To change
them, discard the checkpoint and start a new query. These configurations
include:
+ - `spark.sql.shuffle.partitions`
+ - This is due to the physical partitioning of state: state is partitioned
via applying hash function to key, hence the number of partitions for state
should be unchanged.
+ - If you want to run fewer tasks for stateful operations, `coalesce` would
help with avoiding unnecessary repartitioning.
+ - After `coalesce`, the number of (reduced) tasks will be kept unless
another shuffle happens.
+ - `spark.sql.streaming.stateStore.providerClass`: To read the previous state
of the query properly, the class of state store provider should be unchanged.
+ - `spark.sql.streaming.multipleWatermarkPolicy`: Modification of this would
lead inconsistent watermark value when query contains multiple watermarks,
hence the policy should be unchanged.
+
**Further Reading**
- See and run the
diff --git
a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
index ef3ce98fd7add..ca100da9f019c 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
@@ -266,7 +266,9 @@ object SQLConf {
.createWithDefault(Long.MaxValue)
val SHUFFLE_PARTITIONS = buildConf("spark.sql.shuffle.partitions")
- .doc("The default number of partitions to use when shuffling data for
joins or aggregations.")
+ .doc("The default number of partitions to use when shuffling data for
joins or aggregations. " +
+ "Note: For structured streaming, this configuration cannot be changed
between query " +
+ "restarts from the same checkpoint location.")
.intConf
.createWithDefault(200)
@@ -868,7 +870,9 @@ object SQLConf {
.internal()
.doc(
"The class used to manage state data in stateful streaming queries.
This class must " +
- "be a subclass of StateStoreProvider, and must have a zero-arg
constructor.")
+ "be a subclass of StateStoreProvider, and must have a zero-arg
constructor. " +
+ "Note: For structured streaming, this configuration cannot be
changed between query " +
+ "restarts from the same checkpoint location.")
.stringConf
.createWithDefault(
"org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider")
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]