HeartSaVioR commented on a change in pull request #30789:
URL: https://github.com/apache/spark/pull/30789#discussion_r544731379
##########
File path: docs/structured-streaming-programming-guide.md
##########
@@ -1693,20 +1693,23 @@ end-to-end exactly once per query. Ensuring end-to-end
exactly once for the last
The stateful operations store states for events in state stores of executors.
State stores occupy resources such as memory and disk space to store the states.
So it is more efficient to keep a state store provider running in the same
executor across different streaming batches.
-Changing the location of a state store provider requires loading from
checkpointed states from HDFS in the new executor.
+Changing the location of a state store provider requires the extra overhead of
loading checkpointed states. The overhead of loading state from checkpoint
depends
+on the external storage and the size of the state, which tends to hurt the
latency of micro-batch run. For some use cases such as processing very large
state data,
+loading new state store providers from checkpointed states can be very
time-consuming and inefficient.
The stateful operations in Structured Streaming queries rely on the preferred
location feature of Spark's RDD to run the state store provider on the same
executor.
-However, generally the preferred location is not a hard requirement and it is
still possible that Spark schedules tasks to the executors other than the
preferred ones.
-
-In this case, Spark will load state store providers from checkpointed states
on HDFS to new executors. The state store providers run in the previous batch
will not be unloaded immediately.
If in the next batch the corresponding state store provider is scheduled on
this executor again, it could reuse the previous states and save the time of
loading checkpointed states.
+
+However, generally the preferred location is not a hard requirement and it is
still possible that Spark schedules tasks to the executors other than the
preferred ones.
+In this case, Spark will load state store providers from checkpointed states
on new executors. The state store providers run in the previous batch will not
be unloaded immediately.
Spark runs a maintenance task which checks and unloads the state store
providers that are inactive on the executors.
-For some use cases such as processing very large state data, loading new state
store providers from checkpointed states can be very time-consuming and
inefficient.
By changing the Spark configurations related to task scheduling, for example
`spark.locality.wait`, users can configure Spark how long to wait to launch a
data-local task.
Review comment:
OFF TOPIC (for #30770): probably we may need to allow having different
timeout between state RDD and others.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]