[
https://issues.apache.org/jira/browse/HDDS-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17808012#comment-17808012
]
Hemant Kumar edited comment on HDDS-9486 at 1/18/24 5:47 AM:
-------------------------------------------------------------
I looked at this and there is a deadlock between checkpointing creation for
Bootstrapping and RocksDBCheckpointDiffer#pruneSstFiles.
Bootstrapping takes the BootstrapStateHandler#lock before checkpointing
creation and then takes lock on RocksDBCheckpointDiffer instance to unpause
the compaction thread/s.. On the other hand
RocksDBCheckpointDiffer#pruneSstFiles is synchronized function which first
takes lock on RocksDBCheckpointDiffer and takes BootstrapStateHandler#lock
before removing any files.
Looking at this more, I don't think we need this synchronized block
https://github.com/apache/ozone/pull/5104/files#diff-4e8bcca4269db3fa926667c07d733f58628b13b417bbd76d06c1683edbbd9750R227
was (Author: JIRAUSER297350):
I looked at this and there is a deadlock between checkpointing creation for
Bootstrapping and RocksDBCheckpointDiffer#pruneSstFiles.
Bootstrapping takes the BootstrapStateHandler#lock before checkpointing
creation and then takes lock on RocksDBCheckpointDiffer instance to create
checkpoint. On the other hand RocksDBCheckpointDiffer#pruneSstFiles is
synchronized function which first takes lock on RocksDBCheckpointDiffer and
takes BootstrapStateHandler#lock before removing any files.
Looking at this more, I don't think we need this synchronized block
https://github.com/apache/ozone/pull/5104/files#diff-4e8bcca4269db3fa926667c07d733f58628b13b417bbd76d06c1683edbbd9750R227
> Intermittent fork timeout in TestSnapshotBackgroundServices
> -----------------------------------------------------------
>
> Key: HDDS-9486
> URL: https://issues.apache.org/jira/browse/HDDS-9486
> Project: Apache Ozone
> Issue Type: Sub-task
> Components: test
> Affects Versions: 1.4.0
> Reporter: Attila Doroszlai
> Assignee: Hemant Kumar
> Priority: Major
> Attachments: 2023-09-07T11-48-29_820-jvmRun1.dump,
> 2023-09-14T11-32-20_981-jvmRun1.dump
>
>
> Surefire fork for {{TestSnapshotBackgroundServices}} intermittently times out.
> *
> https://github.com/adoroszlai/ozone-build-results/blob/master/2023/09/07/25178/it-om/output.log
> *
> https://github.com/adoroszlai/ozone-build-results/blob/master/2023/09/14/25374/it-om/output.log
> CC [~hemantk], [~mladjangadzic]
> {code}
> "CompactionDagPruningService"
> java.lang.Thread.State: WAITING
> at sun.misc.Unsafe.park(Native Method)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
> at java.util.concurrent.Semaphore.acquire(Semaphore.java:312)
> at
> org.apache.hadoop.ozone.lock.BootstrapStateHandler$Lock.lock(BootstrapStateHandler.java:31)
> at
> org.apache.ozone.rocksdiff.RocksDBCheckpointDiffer.pruneSstFiles(RocksDBCheckpointDiffer.java:1506)
> at
> org.apache.ozone.rocksdiff.RocksDBCheckpointDiffer$$Lambda$573/124020389.run(Unknown
> Source)
> "qtp555959536-13964"
> java.lang.Thread.State: BLOCKED
> at
> org.apache.hadoop.ozone.om.OMDBCheckpointServlet.getCheckpoint(OMDBCheckpointServlet.java:255)
> at
> org.apache.hadoop.hdds.utils.DBCheckpointServlet.generateSnapshotCheckpoint(DBCheckpointServlet.java:200)
> at
> org.apache.hadoop.hdds.utils.DBCheckpointServlet.doPost(DBCheckpointServlet.java:321)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:523)
> "main"
> java.lang.Thread.State: BLOCKED
> at
> org.apache.ozone.rocksdiff.RocksDBCheckpointDiffer.close(RocksDBCheckpointDiffer.java:340)
> at org.apache.hadoop.hdds.utils.IOUtils.close(IOUtils.java:78)
> at org.apache.hadoop.hdds.utils.IOUtils.close(IOUtils.java:64)
> at org.apache.hadoop.hdds.utils.IOUtils.closeQuietly(IOUtils.java:92)
> at
> org.apache.ozone.rocksdiff.RocksDBCheckpointDiffer$RocksDBCheckpointDifferHolder.invalidateCacheEntry(RocksDBCheckpointDiffer.java:1591)
> at org.apache.hadoop.hdds.utils.db.RDBStore.close(RDBStore.java:224)
> at
> org.apache.hadoop.ozone.om.OmMetadataManagerImpl.stop(OmMetadataManagerImpl.java:753)
> at
> org.apache.hadoop.ozone.om.OzoneManager.stop(OzoneManager.java:2246)
> at
> org.apache.hadoop.ozone.MiniOzoneHAClusterImpl.stop(MiniOzoneHAClusterImpl.java:304)
> at
> org.apache.hadoop.ozone.MiniOzoneClusterImpl.shutdown(MiniOzoneClusterImpl.java:446)
> at
> org.apache.hadoop.ozone.om.TestSnapshotBackgroundServices.shutdown(TestSnapshotBackgroundServices.java:199)
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]