adoroszlai opened a new pull request, #5948:
URL: https://github.com/apache/ozone/pull/5948

   ## What changes were proposed in this pull request?
   
   `TestBackgroundContainerDataScannerIntegration` may fail intermittently.
   
   Example:
   
   ```
   
TestBackgroundContainerDataScannerIntegration.testCorruptionDetected(ContainerCorruptions)[1]
 -- Time elapsed: 13.31 s <<< FAILURE!
   
   Expected: a string containing "MISSING_CHUNKS_DIR"
        but: was "ID=1 | Index=0 | BCSID=4 | State=UNHEALTHY | 
MISSING_CHUNK_FILE for file 
/home/runner/work/ozone/ozone/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-7b4415b0-8a9a-489f-9f19-6b15e56f000b/datanode-0/data-0/containers/hdds/7b4415b0-8a9a-489f-9f19-6b15e56f000b/current/containerDir0/1/chunks/111677748019200001.block.
 Message: 
/home/runner/work/ozone/ozone/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-7b4415b0-8a9a-489f-9f19-6b15e56f000b/datanode-0/data-0/containers/hdds/7b4415b0-8a9a-489f-9f19-6b15e56f000b/current/containerDir0/1/chunks/111677748019200001.block
   ```
   
   One possible problem is that the scanner may already be checking the new 
container when the corruption is applied.  For example if the scanner is 
already checking blocks when chunks directory is deleted, it will report 
`MISSING_CHUNK_FILE` as the reason instead of `MISSING_CHUNKS_DIR`, which it 
only checks at the start of the process.
   
   ```
   2023-12-06 16:49:35,580 
[ContainerDataScanner(/home/runner/work/ozone/ozone/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-7b4415b0-8a9a-489f-9f19-6b15e56f000b/datanode-0/data-0/containers/hdds)]
 ERROR ozoneimpl.BackgroundContainerDataScanner 
(BackgroundContainerDataScanner.java:scanContainer(91)) - Corruption detected 
in container [1]. Marking it UNHEALTHY.
   java.nio.file.NoSuchFileException: 
/home/runner/work/ozone/ozone/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-7b4415b0-8a9a-489f-9f19-6b15e56f000b/datanode-0/data-0/containers/hdds/7b4415b0-8a9a-489f-9f19-6b15e56f000b/current/containerDir0/1/chunks/111677748019200001.block
        ...
        at java.nio.channels.FileChannel.open(FileChannel.java:287)
        at 
org.apache.hadoop.ozone.container.keyvalue.KeyValueContainerCheck.verifyChecksum(KeyValueContainerCheck.java:373)
        at 
org.apache.hadoop.ozone.container.keyvalue.KeyValueContainerCheck.scanBlock(KeyValueContainerCheck.java:350)
        at 
org.apache.hadoop.ozone.container.keyvalue.KeyValueContainerCheck.scanData(KeyValueContainerCheck.java:259)
        at 
org.apache.hadoop.ozone.container.keyvalue.KeyValueContainerCheck.fullCheck(KeyValueContainerCheck.java:157)
        at 
org.apache.hadoop.ozone.container.keyvalue.KeyValueContainer.scanData(KeyValueContainer.java:983)
        at 
org.apache.hadoop.ozone.container.ozoneimpl.BackgroundContainerDataScanner.scanContainer(BackgroundContainerDataScanner.java:89)
   ```
   
   The problem is fixed by letting the test pause the background scanner before 
creating the container, and resume it only after the corruption is applied.  
This ensures the scanner performs all checks from the start on the corrupt 
container, thus finding the problem induced, not another one that is implied 
(e.g. if the chunks directory is deleted, any specific chunk file will be 
missing, too).
   
   The test may also time out intermittently, waiting for container to become 
unhealthy:
   
   ```
   
TestBackgroundContainerDataScannerIntegration.testCorruptionDetected(ContainerCorruptions)[7]
 -- Time elapsed: 9.145 s <<< ERROR!
   TimeoutException: 
     ...
     at 
org.apache.ozone.test.GenericTestUtils.waitFor(GenericTestUtils.java:231)
     at 
org.apache.hadoop.ozone.dn.scanner.TestBackgroundContainerDataScannerIntegration.testCorruptionDetected(TestBackgroundContainerDataScannerIntegration.java:82)
   ```
   
   Timeout is increased for this wait.
   
   Further improvements in `ContainerCorruptions`:
   
    * verify that corruptions were effective (contents changed / file deleted)
    * set `SYNC` when writing the files
   
   https://issues.apache.org/jira/browse/HDDS-9852
   
   ## How was this patch tested?
   
   Test passed in 2x200 runs:
   https://github.com/adoroszlai/ozone/actions/runs/7449863110
   https://github.com/adoroszlai/ozone/actions/runs/7450532753
   
   while it failed in 3/200 runs previously:
   
https://github.com/adoroszlai/ozone/actions/runs/7449847722/job/20269118137#step:3:19
   
   Regular CI:
   https://github.com/adoroszlai/ozone/actions/runs/7450506376


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to