XComp commented on pull request #16989: URL: https://github.com/apache/flink/pull/16989#issuecomment-909320524
I rebased the branch and added another commit that should fix the `YARNSessionCapacitySchedulerITCase.testDetachedPerJobYarnCluster` flakiness we're experiencing locally. See [2f95f6d](https://github.com/apache/flink/pull/16989/commits/2f95f6deb0b880f2bc6da9463d3495c38ee433f0)'s commit message for further details on the change and the issue: ``` We observed YARNSessionCapacitySchedulerITCase.testDetachedPerJobYarnCluster being flaky on our local machines. The AssertionError was caused by a certain log message ("Starting TaskManagers") not being available in the job manager logs. The reason for that was that the JobManager startup script seems to be triggered more than once in some cases. If that happens, two (or more) jobmanager.log files are created with the older having ".N" added as a suffix to the name. Due to the previously used contains method, we ended up picking the older JobManager log file. These logs wouldn't contain the TaskManager startup log message which is required by the assertion. I wasn't able to figure out why we sometimes experience multiple JobManager startups. I checked the Hadoop code for the DefaultContainerExecutor and the DEBUG logs for YARN. I couldn't find any indication for a restart. But Flink renames older log files and keeps the most-recent one as jobmanager.log. That's the one we're interested, anyway. Hence, selecting "jobmanager.log" through equals solves the unstable test. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org