[ 
https://issues.apache.org/jira/browse/HDDS-7738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duong updated HDDS-7738:
------------------------
    Description: 
This is similar to HDDS-5843, but in a different scenario.

 

An Ozone customer encountered this issue after a container (c1) is allocated 
with a newly created pipeline (p1). The chain of events is as follows:
 # SCM processes pipeline creation transaction *p1* => *p1* is {*}created{*}.
 # SCM received a request to close p1 from a data node (see the previous 
comment)
=> *p1* is {*}closed{*}.
=> SCM also tried to find and close relevant containers, at this point, 
container *c1* doesn't *exist* yet, so it {*}can't be closed{*}.
 # SCM processes the container *c1* allocation transaction => failed because 
*p1* is *closed* already.
=> SCM terminates and both transactions #1 and #3 are not committed (as Ratis 
commits transactions in chunks).

Because the transactions are not committed, whenever SCM restarts, it got 
through the same step #1 and #3 and terminates again.

Solution: SCM should not terminate when adding a container with a closed 
pipeline. The fix is similar to HDDS-5843.

> SCM terminates when adding container to a closed pipeline
> ---------------------------------------------------------
>
>                 Key: HDDS-7738
>                 URL: https://issues.apache.org/jira/browse/HDDS-7738
>             Project: Apache Ozone
>          Issue Type: Bug
>            Reporter: Duong
>            Priority: Critical
>
> This is similar to HDDS-5843, but in a different scenario.
>  
> An Ozone customer encountered this issue after a container (c1) is allocated 
> with a newly created pipeline (p1). The chain of events is as follows:
>  # SCM processes pipeline creation transaction *p1* => *p1* is {*}created{*}.
>  # SCM received a request to close p1 from a data node (see the previous 
> comment)
> => *p1* is {*}closed{*}.
> => SCM also tried to find and close relevant containers, at this point, 
> container *c1* doesn't *exist* yet, so it {*}can't be closed{*}.
>  # SCM processes the container *c1* allocation transaction => failed because 
> *p1* is *closed* already.
> => SCM terminates and both transactions #1 and #3 are not committed (as Ratis 
> commits transactions in chunks).
> Because the transactions are not committed, whenever SCM restarts, it got 
> through the same step #1 and #3 and terminates again.
> Solution: SCM should not terminate when adding a container with a closed 
> pipeline. The fix is similar to HDDS-5843.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to