Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/22238#discussion_r213181165
--- Diff: docs/structured-streaming-programming-guide.md ---
@@ -2812,6 +2812,19 @@ See [Input Sources](#input-sources) and [Output
Sinks](#output-sinks) sections f
# Additional Information
+**Notes**
+
+- There're couple of configurations which are not modifiable once you run
the query. If you really want to make changes for these configurations, you
have to discard checkpoint and start a new query.
+ - `spark.sql.shuffle.partitions`
+ - This is due to the physical partitioning of state: state is
partitioned via applying hash function to key, hence the number of partitions
for state should be unchanged.
+ - If you want to run less tasks for stateful operations, `coalesce`
would help with avoiding unnecessary repartitioning.
+ - e.g. `df.groupBy("time").count().coalesce(10)` reduces the number
of tasks by 10, whereas `spark.sql.shuffle.partitions` may be bigger.
+ - After `coalesce`, the number of (reduced) tasks will be kept
unless another shuffle happens.
+ - `spark.sql.streaming.stateStore.providerClass`
--- End diff --
Ah, okay, so there are more instances to describe here. If so, im okay.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]