apoorvmittal10 commented on PR #19759: URL: https://github.com/apache/kafka/pull/19759#issuecomment-2899443174
> @apoorvmittal10 : Thanks for the explanation. It seems that we found a case that an acquired sharePartition lock was released more than once. I am still not sure how it happened. forceComplete() is supposed to only return true once and the lock is only released when forceComplete() returns true. How is the acquired lock released a second time? Prior to this PR there is no lock in [forceComplete](https://github.com/apache/kafka/blob/e88c10d5952c464a0f597b295deaa51e4c9fa978/server-common/src/main/java/org/apache/kafka/server/purgatory/DelayedOperation.java#L73) but a way to make sure that onComplete is not executed twice. However, consider that thread 1 has started executing [safeTryComplete](https://github.com/apache/kafka/blob/e88c10d5952c464a0f597b295deaa51e4c9fa978/server-common/src/main/java/org/apache/kafka/server/purgatory/DelayedOperation.java#L133) which attains lock and checks that if the operation has marked completed, if not then tryComplete is executed. Now thread 1 is executing tryComplete, then thread 2 on timeout executes [forceComplete](https://github.com/apache/kafka/blob/e88c10d5952c464a0f597b295deaa51e4c9fa978/server-common/src/main/java/org/apache/kafka/server/purgatory/DelayedOperation.java#L148) which marks completed as true and waits [here](https://github.com/apache/kafka/blob/e88c10d5952c464a0f59 7b295deaa51e4c9fa978/core/src/main/java/kafka/server/share/DelayedShareFetch.java#L199) for the operation lock. Now thread 1, in tryComplete, acquires some share partitions but as [completed](https://github.com/apache/kafka/blob/e88c10d5952c464a0f597b295deaa51e4c9fa978/core/src/main/java/kafka/server/share/DelayedShareFetch.java#L899) is already marked true then the share partitions are released and so the lock for delayed operation. Thread 2 now starts executing `onComplete` but as [partitionsAcquired](https://github.com/apache/kafka/blob/e88c10d5952c464a0f597b295deaa51e4c9fa978/core/src/main/java/kafka/server/share/DelayedShareFetch.java#L106) is at instance level hence `onComplete` tries to execute further with those share partitions which are already released. Once completed then onCompete releases those share partitions again. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org