anishshri-db commented on code in PR #44542:
URL: https://github.com/apache/spark/pull/44542#discussion_r1438964350
##########
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/statefulOperators.scala:
##########
@@ -434,22 +434,26 @@ case class StateStoreRestoreExec(
numColsPrefixKey = 0,
session.sessionState,
Some(session.streams.stateStoreCoordinator)) { case (store, iter) =>
- val hasInput = iter.hasNext
- if (!hasInput && keyExpressions.isEmpty) {
- // If our `keyExpressions` are empty, we're getting a global
aggregation. In that case
- // the `HashAggregateExec` will output a 0 value for the partial
merge. We need to
- // restore the value, so that we don't overwrite our state with a 0
value, but rather
- // merge the 0 with existing state.
- store.iterator().map(_.value)
- } else {
- iter.flatMap { row =>
- val key = stateManager.getKey(row.asInstanceOf[UnsafeRow])
- val restoredRow = stateManager.get(store, key)
- val outputRows = Option(restoredRow).toSeq :+ row
- numOutputRows += outputRows.size
- outputRows
- }
+ val hasInput = iter.hasNext
+ val result = if (!hasInput && keyExpressions.isEmpty) {
+ // If our `keyExpressions` are empty, we're getting a global
aggregation. In that case
+ // the `HashAggregateExec` will output a 0 value for the partial
merge. We need to
+ // restore the value, so that we don't overwrite our state with a 0
value, but rather
+ // merge the 0 with existing state.
+ store.iterator().map(_.value)
+ } else {
+ iter.flatMap { row =>
+ val key = stateManager.getKey(row.asInstanceOf[UnsafeRow])
+ val restoredRow = stateManager.get(store, key)
+ val outputRows = Option(restoredRow).toSeq :+ row
+ numOutputRows += outputRows.size
+ outputRows
}
+ }
+ // SPARK-46547 - Release any locks/resources if required, to prevent
+ // deadlocks with the maintenance thread.
+ store.abort()
Review Comment:
I believe this issue always existed. In the common case, we won't see this
that often though. Basically the restore is followed by the save operator which
opens the db instance in read-write mode. At the end of the save operator, we
would always release the instance lock. For the `ReadStateStore` invocations
though, there is no such functionality (we only abort on task failure) - so the
interleaving of the maintenance thread error case (also likely rare) and the
execution of StateStoreRDD for the `save` operator causes this deadlock
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]