[ 
https://issues.apache.org/jira/browse/HDDS-10059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17806913#comment-17806913
 ] 

Christos Bisias edited comment on HDDS-10059 at 1/15/24 4:54 PM:
-----------------------------------------------------------------

This test
 * Begins with an inactive follower
 * Writes snapshots, each with new keys
 * Starts the inactive follower
 ** The new follower installs a Ratis Snapshot that has all the snapshots and 
their keys from the leader
 * Validates the follower’s data from the latest snapshot 

During the data validation, the test checks that every snapshot file that is a 
hard-link on the leader is also a hard-link on the follower.

When the test is run repeatedly on a remote workflow, it fails almost 50% of 
the time. As a result, the test has been marked as unhealthy and removed from 
the CI. During these failures, 1 or more files that can be found on the 
leader's active fs and are hard-links, don’t exist under the follower’s active 
fs.

*The test assumes that if the file exists on both directories, active fs and 
snapshot, then there is a hard link. The follower gets a fraction of the sst 
files that exist on the leader’s active fs dir and that could be the issue.*

To elaborate, this is how the follower gets the files found on the active fs 
dir  
 # The leader gets a RocksDB checkpoint
 # The files from the checkpoint get stored under a new dir 
*leader_node/db.checkpoints/om.db_checkpoint_…*
 # The checkpoint files will be stored on the root of tarball and will end up 
under the active fs dir on the follower
 # A hardLinkFile is created that contains all the hard links that the follower 
must create between the files in the active fs and the snapshot directories

I have gone through the entire process from the leader getting the checkpoint 
up until the follower installing the tarball and creating the hard links. The 
missing files aren’t part of the RocksDB checkpoint. Therefore, we can’t find 
them in the checkpoint dir, the tarball root and the hardLinkFile. The 
hardLinkFile contains references to the snapshots that contain these sst files 
but not the active fs root.

 

The RocksDB checkpoint contains only a small part of the files existing in the 
leader’s active fs dir. This is an example of the number of files found in both 
dirs

 

*leader.activeFs number of files: 264*

*leader.snapshot number of files: 52*

 

*follower.activeFs number of files: 59*

*follower.snapshot number of files: 58*

 

 The test does the following
 # Gets all the files in the leader's snapshot dir
 # For every file,
 ## Gets the filename
 ## If it exists under the leader's active fs
 ## Checks if the file on the active fs and the file on the snapshot dir, on 
the leader, are hard-links
 ## If they are hard-links on the leader, asserts that they are hard-links on 
the follower as well

 

As mentioned above, not all sst files that can be found on the leader's active 
fs, are included in a checkpoint. Because we are iterating the leader's sst 
files, there is a very high chance that these files won't be present on the 
follower.

The above steps can be changed to
 # Get all the files from the follower's snapshot dir
 # For every file
 ## Get the filename
 ## If it exists on the follower's active fs AND if it exists on the leader's 
active fs
 ## Check whether the snapshot and active fs files are hard-links on the leader
 ## If they are hard-links on the leader, assert that they are hard-links on 
the follower as well

The above approach makes the test pass 10x10 : 
[https://github.com/xBis7/ozone/actions/runs/7507693387]

[~hemantk] Does this seem like a reasonable change? Or do you think there is a 
bug here that needs further investigation?

The previous approach was checking that all hard-links on the leader, are 
present on the follower. This approach checks only the files that can be found 
on the follower. If these files are hard-links on the leader, then check that 
these files are hard-links on the follower as well. 


was (Author: JIRAUSER285705):
This test
 * Begins with an inactive follower
 * Writes snapshots, each with new keys
 * Starts the inactive follower
 ** The new follower installs a Ratis Snapshot that has all the snapshots and 
their keys from the leader
 * Validates the follower’s data from the latest snapshot 

During the data validation, the test checks that every snapshot file that is a 
hard-link on the leader is also a hard-link on the follower.

When the test is run repeatedly on a remote workflow, it fails almost 50% of 
the time. As a result, the test has been marked as unhealthy and removed from 
the CI. During these failures, 1 or more files that can be found on the 
leader's active fs and are hard-links, don’t exist under the follower’s active 
fs.

*The test assumes that if the file exists on both directories, active fs and 
snapshot, then there is a hard link. The follower gets a fraction of the sst 
files that exist on the leader’s active fs dir and that could be the issue.*

To elaborate, this is how the follower gets the files found on the active fs 
dir  
 # The leader gets a RocksDB checkpoint
 # The files from the checkpoint get stored under a new dir 
*leader_node/db.checkpoints/om.db_checkpoint_…*
 # The checkpoint files will be stored on the root of tarball and will end up 
under the active fs dir on the follower
 # A hardLinkFile is created that contains all the hard links that the follower 
must create between the files in the active fs and the snapshot directories

I have gone through the entire process from the leader getting the checkpoint 
up until the follower installing the tarball and creating the hard links. The 
missing files aren’t part of the RocksDB checkpoint. Therefore, we can’t find 
them in the checkpoint dir, the tarball root and the hardLinkFile. The 
hardLinkFile contains references to the snapshots that contain these sst files 
but not the active fs root.

 

The RocksDB checkpoint contains only a small part of the files existing in the 
leader’s active fs dir. This is an example of the number of files found in both 
dirs

 

*leader.activeFs number of files: 264*

*leader.snapshot number of files: 52*

 

*follower.activeFs number of files: 59*

*follower.snapshot number of files: 58*

 

 The test does the following
 # Gets all the files in the leader's snapshot dir
 # For every file,
 ## Gets the filename
 ## If it exists under the leader's active fs
 ## Checks if the file on the active fs and the file on the snapshot dir, on 
the leader, are hard-links
 ## If they are hard-links on the leader, asserts that they are hard-links on 
the follower as well

 

As mentioned above, not all sst files that can be found on the leader's active 
fs, are included in a checkpoint. Because we are iterating the leader's sst 
files, there is a very high chance that these files won't be present on the 
follower.

The above steps can be changed to
 # Get all the files from the follower's snapshot dir
 # For every file
 ## Get the filename
 ## If it exists on the follower's active fs AND if it exists on the leader's 
active fs
 ## Check whether the snapshot and active fs files are hard-links on the leader
 ## If they are hard-links on the leader, assert that they are hard-links on 
the follower as well

The above approach makes the test pass 10x10 : 
[https://github.com/xBis7/ozone/actions/runs/7507693387|https://github.com/xBis7/ozone/actions/runs/7507693387]

[~hemantk] Does this seem like a reasonable change? Or do you think there is a 
bug here that needs further investigation?

> [disabled] Intermittent failure in TestOMRatisSnapshots.testInstallSnapshot
> ---------------------------------------------------------------------------
>
>                 Key: HDDS-10059
>                 URL: https://issues.apache.org/jira/browse/HDDS-10059
>             Project: Apache Ozone
>          Issue Type: Sub-task
>          Components: test
>            Reporter: Attila Doroszlai
>            Assignee: Christos Bisias
>            Priority: Major
>
> Failure 1:
> {code:title=https://github.com/adoroszlai/ozone-build-results/blob/master/2023/12/30/27977/it-om/hadoop-ozone/integration-test/org.apache.hadoop.ozone.om.TestOMRatisSnapshots.txt}
> org.apache.hadoop.ozone.om.TestOMRatisSnapshots.testInstallSnapshot(int, 
> Path)[1] -- Time elapsed: 90.79 s <<< ERROR!
> java.io.IOException: snapshot directory doesn't exist
>       at 
> org.apache.hadoop.ozone.om.TestOMRatisSnapshots.createOzoneSnapshot(TestOMRatisSnapshots.java:1060)
>       at 
> org.apache.hadoop.ozone.om.TestOMRatisSnapshots.testInstallSnapshot(TestOMRatisSnapshots.java:238)
> {code}
> Failure 2:
> {code:title=https://github.com/adoroszlai/ozone-build-results/blob/master/2024/01/03/28076/it-om/hadoop-ozone/integration-test/org.apache.hadoop.ozone.om.TestOMRatisSnapshots.txt}
> org.apache.hadoop.ozone.om.TestOMRatisSnapshots.testInstallSnapshot(int, 
> Path)[1] -- Time elapsed: 85.34 s <<< ERROR!
> java.nio.file.NoSuchFileException: 
> /home/runner/work/ozone/ozone/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-3a4cbda0-c8c0-415c-b2a8-04058ca404e1/omNode-3/om.db/000786.sst
>       at 
> sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
>       at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
>       at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
>       at 
> sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55)
>       at 
> sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:144)
>       at 
> sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99)
>       at java.nio.file.Files.readAttributes(Files.java:1737)
>       at 
> org.apache.hadoop.ozone.om.snapshot.OmSnapshotUtils.getINode(OmSnapshotUtils.java:67)
>       at 
> org.apache.hadoop.ozone.om.TestOMRatisSnapshots.checkSnapshot(TestOMRatisSnapshots.java:373)
>       at 
> org.apache.hadoop.ozone.om.TestOMRatisSnapshots.testInstallSnapshot(TestOMRatisSnapshots.java:312)
> {code}
> [~xBis] would you like to take a look?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to