[GitHub] srowen commented on a change in pull request #22238: [SPARK-25245][DOCS][SS] Explain regarding limiting modification on "spark.sql.shuffle.partitions" for structured streaming

GitBox Sat, 15 Dec 2018 16:44:27 -0800

srowen commented on a change in pull request #22238: [SPARK-25245][DOCS][SS] 
Explain regarding limiting modification on "spark.sql.shuffle.partitions" for 
structured streaming
URL: https://github.com/apache/spark/pull/22238#discussion_r241964637


 ##########
 File path: docs/structured-streaming-programming-guide.md
 ##########
 @@ -2812,6 +2812,19 @@ See [Input Sources](#input-sources) and [Output 
Sinks](#output-sinks) sections f
 
 # Additional Information
 
+**Notes**
+
+- There're couple of configurations which are not modifiable once you run the 
query. If you really want to make changes for these configurations, you have to 
discard checkpoint and start a new query.
+  - `spark.sql.shuffle.partitions`
+    - This is due to the physical partitioning of state: state is partitioned 
via applying hash function to key, hence the number of partitions for state 
should be unchanged.
+    - If you want to run less tasks for stateful operations, `coalesce` would 
help with avoiding unnecessary repartitioning.
+      - e.g. `df.groupBy("time").count().coalesce(10)` reduces the number of 
tasks by 10, whereas `spark.sql.shuffle.partitions` may be bigger.
+      - After `coalesce`, the number of (reduced) tasks will be kept unless 
another shuffle happens.
+  - `spark.sql.streaming.stateStore.providerClass`
+    - To read previous state of the query properly, the class of state store 
provider should be unchanged.
 
 Review comment:
   To read _the_ previous state, etc. These also don't need to be sub bullet 
points?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] srowen commented on a change in pull request #22238: [SPARK-25245][DOCS][SS] Explain regarding limiting modification on "spark.sql.shuffle.partitions" for structured streaming

Reply via email to