srowen commented on a change in pull request #22238: [SPARK-25245][DOCS][SS]
Explain regarding limiting modification on "spark.sql.shuffle.partitions" for
structured streaming
URL: https://github.com/apache/spark/pull/22238#discussion_r241964637
##########
File path: docs/structured-streaming-programming-guide.md
##########
@@ -2812,6 +2812,19 @@ See [Input Sources](#input-sources) and [Output
Sinks](#output-sinks) sections f
# Additional Information
+**Notes**
+
+- There're couple of configurations which are not modifiable once you run the
query. If you really want to make changes for these configurations, you have to
discard checkpoint and start a new query.
+ - `spark.sql.shuffle.partitions`
+ - This is due to the physical partitioning of state: state is partitioned
via applying hash function to key, hence the number of partitions for state
should be unchanged.
+ - If you want to run less tasks for stateful operations, `coalesce` would
help with avoiding unnecessary repartitioning.
+ - e.g. `df.groupBy("time").count().coalesce(10)` reduces the number of
tasks by 10, whereas `spark.sql.shuffle.partitions` may be bigger.
+ - After `coalesce`, the number of (reduced) tasks will be kept unless
another shuffle happens.
+ - `spark.sql.streaming.stateStore.providerClass`
+ - To read previous state of the query properly, the class of state store
provider should be unchanged.
Review comment:
To read _the_ previous state, etc. These also don't need to be sub bullet
points?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]