[ https://issues.apache.org/jira/browse/MESOS-9281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16669334#comment-16669334 ]
Chun-Hung Hsiao commented on MESOS-9281: ---------------------------------------- Addinitonal patch: https://reviews.apache.org/r/69085/ > SLRP gets a stale checkpoint after system crash. > ------------------------------------------------ > > Key: MESOS-9281 > URL: https://issues.apache.org/jira/browse/MESOS-9281 > Project: Mesos > Issue Type: Bug > Components: storage > Affects Versions: 1.5.0, 1.6.0, 1.7.0, 1.8.0 > Reporter: Chun-Hung Hsiao > Assignee: Chun-Hung Hsiao > Priority: Blocker > Labels: mesosphere, storage > > SLRP checkpoints a pending operations before issuing the corresponding CSI > call through {{slave::state::checkpoint}}, which writes a new checkpoint to a > temporary file then do a {{rename}}. However, because we don't do any > {{fsync}}, {{rename}} is not atomic w.r.t. system crash. As a result, if the > operation is processed during a system crash, it is possible that the CSI > call has been executed, but the SLRP gets back a stale checkpoint after > reboot and totally doesn't know about the operation. > To address this problem, we need to ensure the followings before issuing the > CSI call: > 1. The temp file is synced to the disk. > 2. The rename is committed to the disk. > A possible solution is to do an {{fsync}} after writing the temp file, and do > another {{fsync}} on the checkpoint dir after the {{rename}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)