adoroszlai commented on PR #8093: URL: https://github.com/apache/ozone/pull/8093#issuecomment-2727210683
> this seems a bug in the datanode code (or in the test) -- it should never have more than one threads writing to the VERSION file at the same time. So, we should fix them instead. There is no test involved, it can be reproduced by simply starting a new datanode in HA cluster. ``` cd hadoop-ozone/dist/target/ozone-2.0.0-SNAPSHOT/compose/ozone-h docker compose up -d --scale datanode=3 docker compose exec s3g ozone admin safemode wait -t 60 docker compose up -d --no-recreate --scale datanode=4 docker logs ozone-ha-datanode-4 2>&1 | grep 'Failed to Atomic Files.move' ``` I think there are two issues: 1. Given that multiple threads may write file `X`, using the same temp file `X.tmp` for `AtomicFileOutputStream` introduces race condition. This can be fixed simply by this patch. The fix is in utils code, which is beneficial for other possible users of these methods. (It should also be fixed in Ratis, but currently we cannot upgrade (HDDS-12103) and do not want to wait for new release anyway.) 2. The problem that datanode writes the VERSION file from multiple threads. The file includes clusterID, which datanode gets from SCM, so these "get version" tasks seem to be the right place to do it. They should probably check if the file exists, and use a shared lock while checking/writing. They may also need to check if clusterID from each SCM is the same. This PR is limited to (1), we can fix (2) separately. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
