[
https://issues.apache.org/jira/browse/HDDS-10059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17806913#comment-17806913
]
Christos Bisias edited comment on HDDS-10059 at 1/15/24 4:54 PM:
-----------------------------------------------------------------
This test
* Begins with an inactive follower
* Writes snapshots, each with new keys
* Starts the inactive follower
** The new follower installs a Ratis Snapshot that has all the snapshots and
their keys from the leader
* Validates the follower’s data from the latest snapshot
During the data validation, the test checks that every snapshot file that is a
hard-link on the leader is also a hard-link on the follower.
When the test is run repeatedly on a remote workflow, it fails almost 50% of
the time. As a result, the test has been marked as unhealthy and removed from
the CI. During these failures, 1 or more files that can be found on the
leader's active fs and are hard-links, don’t exist under the follower’s active
fs.
*The test assumes that if the file exists on both directories, active fs and
snapshot, then there is a hard link. The follower gets a fraction of the sst
files that exist on the leader’s active fs dir and that could be the issue.*
To elaborate, this is how the follower gets the files found on the active fs
dir
# The leader gets a RocksDB checkpoint
# The files from the checkpoint get stored under a new dir
*leader_node/db.checkpoints/om.db_checkpoint_…*
# The checkpoint files will be stored on the root of tarball and will end up
under the active fs dir on the follower
# A hardLinkFile is created that contains all the hard links that the follower
must create between the files in the active fs and the snapshot directories
I have gone through the entire process from the leader getting the checkpoint
up until the follower installing the tarball and creating the hard links. The
missing files aren’t part of the RocksDB checkpoint. Therefore, we can’t find
them in the checkpoint dir, the tarball root and the hardLinkFile. The
hardLinkFile contains references to the snapshots that contain these sst files
but not the active fs root.
The RocksDB checkpoint contains only a small part of the files existing in the
leader’s active fs dir. This is an example of the number of files found in both
dirs
*leader.activeFs number of files: 264*
*leader.snapshot number of files: 52*
*follower.activeFs number of files: 59*
*follower.snapshot number of files: 58*
The test does the following
# Gets all the files in the leader's snapshot dir
# For every file,
## Gets the filename
## If it exists under the leader's active fs
## Checks if the file on the active fs and the file on the snapshot dir, on
the leader, are hard-links
## If they are hard-links on the leader, asserts that they are hard-links on
the follower as well
As mentioned above, not all sst files that can be found on the leader's active
fs, are included in a checkpoint. Because we are iterating the leader's sst
files, there is a very high chance that these files won't be present on the
follower.
The above steps can be changed to
# Get all the files from the follower's snapshot dir
# For every file
## Get the filename
## If it exists on the follower's active fs AND if it exists on the leader's
active fs
## Check whether the snapshot and active fs files are hard-links on the leader
## If they are hard-links on the leader, assert that they are hard-links on
the follower as well
The above approach makes the test pass 10x10 :
[https://github.com/xBis7/ozone/actions/runs/7507693387]
[~hemantk] Does this seem like a reasonable change? Or do you think there is a
bug here that needs further investigation?
The previous approach was checking that all hard-links on the leader, are
present on the follower. This approach checks only the files that can be found
on the follower. If these files are hard-links on the leader, then check that
these files are hard-links on the follower as well.
was (Author: JIRAUSER285705):
This test
* Begins with an inactive follower
* Writes snapshots, each with new keys
* Starts the inactive follower
** The new follower installs a Ratis Snapshot that has all the snapshots and
their keys from the leader
* Validates the follower’s data from the latest snapshot
During the data validation, the test checks that every snapshot file that is a
hard-link on the leader is also a hard-link on the follower.
When the test is run repeatedly on a remote workflow, it fails almost 50% of
the time. As a result, the test has been marked as unhealthy and removed from
the CI. During these failures, 1 or more files that can be found on the
leader's active fs and are hard-links, don’t exist under the follower’s active
fs.
*The test assumes that if the file exists on both directories, active fs and
snapshot, then there is a hard link. The follower gets a fraction of the sst
files that exist on the leader’s active fs dir and that could be the issue.*
To elaborate, this is how the follower gets the files found on the active fs
dir
# The leader gets a RocksDB checkpoint
# The files from the checkpoint get stored under a new dir
*leader_node/db.checkpoints/om.db_checkpoint_…*
# The checkpoint files will be stored on the root of tarball and will end up
under the active fs dir on the follower
# A hardLinkFile is created that contains all the hard links that the follower
must create between the files in the active fs and the snapshot directories
I have gone through the entire process from the leader getting the checkpoint
up until the follower installing the tarball and creating the hard links. The
missing files aren’t part of the RocksDB checkpoint. Therefore, we can’t find
them in the checkpoint dir, the tarball root and the hardLinkFile. The
hardLinkFile contains references to the snapshots that contain these sst files
but not the active fs root.
The RocksDB checkpoint contains only a small part of the files existing in the
leader’s active fs dir. This is an example of the number of files found in both
dirs
*leader.activeFs number of files: 264*
*leader.snapshot number of files: 52*
*follower.activeFs number of files: 59*
*follower.snapshot number of files: 58*
The test does the following
# Gets all the files in the leader's snapshot dir
# For every file,
## Gets the filename
## If it exists under the leader's active fs
## Checks if the file on the active fs and the file on the snapshot dir, on
the leader, are hard-links
## If they are hard-links on the leader, asserts that they are hard-links on
the follower as well
As mentioned above, not all sst files that can be found on the leader's active
fs, are included in a checkpoint. Because we are iterating the leader's sst
files, there is a very high chance that these files won't be present on the
follower.
The above steps can be changed to
# Get all the files from the follower's snapshot dir
# For every file
## Get the filename
## If it exists on the follower's active fs AND if it exists on the leader's
active fs
## Check whether the snapshot and active fs files are hard-links on the leader
## If they are hard-links on the leader, assert that they are hard-links on
the follower as well
The above approach makes the test pass 10x10 :
[https://github.com/xBis7/ozone/actions/runs/7507693387|https://github.com/xBis7/ozone/actions/runs/7507693387]
[~hemantk] Does this seem like a reasonable change? Or do you think there is a
bug here that needs further investigation?
> [disabled] Intermittent failure in TestOMRatisSnapshots.testInstallSnapshot
> ---------------------------------------------------------------------------
>
> Key: HDDS-10059
> URL: https://issues.apache.org/jira/browse/HDDS-10059
> Project: Apache Ozone
> Issue Type: Sub-task
> Components: test
> Reporter: Attila Doroszlai
> Assignee: Christos Bisias
> Priority: Major
>
> Failure 1:
> {code:title=https://github.com/adoroszlai/ozone-build-results/blob/master/2023/12/30/27977/it-om/hadoop-ozone/integration-test/org.apache.hadoop.ozone.om.TestOMRatisSnapshots.txt}
> org.apache.hadoop.ozone.om.TestOMRatisSnapshots.testInstallSnapshot(int,
> Path)[1] -- Time elapsed: 90.79 s <<< ERROR!
> java.io.IOException: snapshot directory doesn't exist
> at
> org.apache.hadoop.ozone.om.TestOMRatisSnapshots.createOzoneSnapshot(TestOMRatisSnapshots.java:1060)
> at
> org.apache.hadoop.ozone.om.TestOMRatisSnapshots.testInstallSnapshot(TestOMRatisSnapshots.java:238)
> {code}
> Failure 2:
> {code:title=https://github.com/adoroszlai/ozone-build-results/blob/master/2024/01/03/28076/it-om/hadoop-ozone/integration-test/org.apache.hadoop.ozone.om.TestOMRatisSnapshots.txt}
> org.apache.hadoop.ozone.om.TestOMRatisSnapshots.testInstallSnapshot(int,
> Path)[1] -- Time elapsed: 85.34 s <<< ERROR!
> java.nio.file.NoSuchFileException:
> /home/runner/work/ozone/ozone/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-3a4cbda0-c8c0-415c-b2a8-04058ca404e1/omNode-3/om.db/000786.sst
> at
> sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
> at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
> at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
> at
> sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55)
> at
> sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:144)
> at
> sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99)
> at java.nio.file.Files.readAttributes(Files.java:1737)
> at
> org.apache.hadoop.ozone.om.snapshot.OmSnapshotUtils.getINode(OmSnapshotUtils.java:67)
> at
> org.apache.hadoop.ozone.om.TestOMRatisSnapshots.checkSnapshot(TestOMRatisSnapshots.java:373)
> at
> org.apache.hadoop.ozone.om.TestOMRatisSnapshots.testInstallSnapshot(TestOMRatisSnapshots.java:312)
> {code}
> [~xBis] would you like to take a look?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]