[GitHub] spark pull request #22238: [SPARK-25245][DOCS][SS] Explain regarding limitin...

HyukjinKwon Mon, 27 Aug 2018 21:27:25 -0700

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22238#discussion_r213181165
  
    --- Diff: docs/structured-streaming-programming-guide.md ---
    @@ -2812,6 +2812,19 @@ See [Input Sources](#input-sources) and [Output 
Sinks](#output-sinks) sections f
     
     # Additional Information
     
    +**Notes**
    +
    +- There're couple of configurations which are not modifiable once you run 
the query. If you really want to make changes for these configurations, you 
have to discard checkpoint and start a new query.
    +  - `spark.sql.shuffle.partitions`
    +    - This is due to the physical partitioning of state: state is 
partitioned via applying hash function to key, hence the number of partitions 
for state should be unchanged.
    +    - If you want to run less tasks for stateful operations, `coalesce` 
would help with avoiding unnecessary repartitioning.
    +      - e.g. `df.groupBy("time").count().coalesce(10)` reduces the number 
of tasks by 10, whereas `spark.sql.shuffle.partitions` may be bigger.
    +      - After `coalesce`, the number of (reduced) tasks will be kept 
unless another shuffle happens.
    +  - `spark.sql.streaming.stateStore.providerClass`
    --- End diff --
    
    Ah, okay, so there are more instances to describe here. If so, im okay.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #22238: [SPARK-25245][DOCS][SS] Explain regarding limitin...

Reply via email to