Github user HeartSaVioR commented on a diff in the pull request:
https://github.com/apache/spark/pull/22238#discussion_r213129120
--- Diff: docs/structured-streaming-programming-guide.md ---
@@ -2812,6 +2812,12 @@ See [Input Sources](#input-sources) and [Output
Sinks](#output-sinks) sections f
# Additional Information
+**Gotchas**
+
+- For structured streaming, modifying "spark.sql.shuffle.partitions" is
restricted once you run the query.
+ - This is because state is partitioned via key, hence number of
partitions for state should be unchanged.
+ - If you want to run less tasks for stateful operations, `coalesce`
would help with avoiding unnecessary repartitioning. Please note that it will
also affect downstream operators.
--- End diff --
It just means that the number of partitions in stateful operations' output
will be same as parameter for `coalesce`, and the number of partitions will be
kept unless another shuffle happens. It is implicitly same as
`spark.sql.shuffle.partitions`, which default value is 200.
I'll add the code, but not sure we need to have the code per language like
Scala / Java / Python tabs since they will be same.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]