[GitHub] [spark] HeartSaVioR commented on a change in pull request #30789: [SPARK-33797][SS][DOCS] Update SS doc about State Store and task locality

GitBox Wed, 16 Dec 2020 16:56:34 -0800


HeartSaVioR commented on a change in pull request #30789:
URL: https://github.com/apache/spark/pull/30789#discussion_r544731379




##########
File path: docs/structured-streaming-programming-guide.md
##########
@@ -1693,20 +1693,23 @@ end-to-end exactly once per query. Ensuring end-to-end 
exactly once for the last
 
 The stateful operations store states for events in state stores of executors. 
State stores occupy resources such as memory and disk space to store the states.
 So it is more efficient to keep a state store provider running in the same 
executor across different streaming batches.
-Changing the location of a state store provider requires loading from 
checkpointed states from HDFS in the new executor.
+Changing the location of a state store provider requires the extra overhead of 
loading checkpointed states. The overhead of loading state from checkpoint 
depends
+on the external storage and the size of the state, which tends to hurt the 
latency of micro-batch run. For some use cases such as processing very large 
state data,
+loading new state store providers from checkpointed states can be very 
time-consuming and inefficient.
 
 The stateful operations in Structured Streaming queries rely on the preferred 
location feature of Spark's RDD to run the state store provider on the same 
executor.
-However, generally the preferred location is not a hard requirement and it is 
still possible that Spark schedules tasks to the executors other than the 
preferred ones.
-
-In this case, Spark will load state store providers from checkpointed states 
on HDFS to new executors. The state store providers run in the previous batch 
will not be unloaded immediately.
 If in the next batch the corresponding state store provider is scheduled on 
this executor again, it could reuse the previous states and save the time of 
loading checkpointed states.
+
+However, generally the preferred location is not a hard requirement and it is 
still possible that Spark schedules tasks to the executors other than the 
preferred ones.
+In this case, Spark will load state store providers from checkpointed states 
on new executors. The state store providers run in the previous batch will not 
be unloaded immediately.
 Spark runs a maintenance task which checks and unloads the state store 
providers that are inactive on the executors.
 
-For some use cases such as processing very large state data, loading new state 
store providers from checkpointed states can be very time-consuming and 
inefficient.
 By changing the Spark configurations related to task scheduling, for example 
`spark.locality.wait`, users can configure Spark how long to wait to launch a 
data-local task.

Review comment:
       OFF TOPIC (for #30770): probably we may need to allow having different 
timeout between state RDD and others.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HeartSaVioR commented on a change in pull request #30789: [SPARK-33797][SS][DOCS] Update SS doc about State Store and task locality

Reply via email to