[jira] [Updated] (SPARK-51596) Fix concurrent StateStoreProvider maintenance and closing

Livia Zhu (Jira) Mon, 24 Mar 2025 14:48:34 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-51596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Livia Zhu updated SPARK-51596:
------------------------------
    Description: 
Currently, both the task thread and maintenance thread can call unload() on a 
provider. This leads to a race condition where the maintenance could be 
conducting maintenance while the task thread is closing the provider, leading 
to unexpected behavior.

We want to guarantee that when maintenance is run, the provider is not 
closed/closing. The easiest way to do this is to move the unload operation into 
the maintenance thread. To continue unloading ASAP (rather than potentially 
waiting for the maintenance interval) as was introduced by 
https://issues.apache.org/jira/browse/SPARK-33827, we should immediately 
trigger a maintenance thread to do the unload.

This gives us an extra benefit that unloading other providers doesn't block the 
task thread. To capitalize on this, unload() should not hold the 
loadedProviders lock the entire time (which will block other task threads), but 
instead release it once it has deleted the unloading providers from the map and 
close the providers without the lock held.

  was:
Currently, both the task thread and maintenance thread can call unload() on a 
provider. This leads to a race condition where the maintenance could be 
conducting maintenance while the task thread is closing the provider, leading 
to unexpected behavior.

We want to guarantee that when maintenance is run, the provider is not 
closed/closing. The easiest way to do this is to move the unload operation into 
the maintenance thread. To continue unloading ASAP (rather than potentially 
waiting for the maintenance interval) as was done by 
https://issues.apache.org/jira/browse/SPARK-33827, we should immediately 
trigger a maintenance thread to do the unload.

This gives us an extra benefit that unloading other providers doesn't block the 
task thread. To capitalize on this, unload() should not hold the 
loadedProviders lock the entire time (which will block other task threads), but 
instead release it once it has deleted the unloading providers from the map and 
close the providers without the lock held.


> Fix concurrent StateStoreProvider maintenance and closing
> ---------------------------------------------------------
>
>                 Key: SPARK-51596
>                 URL: https://issues.apache.org/jira/browse/SPARK-51596
>             Project: Spark
>          Issue Type: Task
>          Components: Structured Streaming
>    Affects Versions: 4.0.0
>            Reporter: Livia Zhu
>            Priority: Major
>
> Currently, both the task thread and maintenance thread can call unload() on a 
> provider. This leads to a race condition where the maintenance could be 
> conducting maintenance while the task thread is closing the provider, leading 
> to unexpected behavior.
> We want to guarantee that when maintenance is run, the provider is not 
> closed/closing. The easiest way to do this is to move the unload operation 
> into the maintenance thread. To continue unloading ASAP (rather than 
> potentially waiting for the maintenance interval) as was introduced by 
> https://issues.apache.org/jira/browse/SPARK-33827, we should immediately 
> trigger a maintenance thread to do the unload.
> This gives us an extra benefit that unloading other providers doesn't block 
> the task thread. To capitalize on this, unload() should not hold the 
> loadedProviders lock the entire time (which will block other task threads), 
> but instead release it once it has deleted the unloading providers from the 
> map and close the providers without the lock held.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-51596) Fix concurrent StateStoreProvider maintenance and closing

Reply via email to