XComp commented on pull request #16989:
URL: https://github.com/apache/flink/pull/16989#issuecomment-909320524


   I rebased the branch and added another commit that should fix the 
`YARNSessionCapacitySchedulerITCase.testDetachedPerJobYarnCluster` flakiness 
we're experiencing locally. See 
[2f95f6d](https://github.com/apache/flink/pull/16989/commits/2f95f6deb0b880f2bc6da9463d3495c38ee433f0)'s
 commit message for further details on the change and the issue:
   ```
   We observed YARNSessionCapacitySchedulerITCase.testDetachedPerJobYarnCluster
   being flaky on our local machines. The AssertionError was caused by a
   certain log message ("Starting TaskManagers") not being available in the
   job manager logs. The reason for that was that the JobManager startup
   script seems to be triggered more than once in some cases. If that
   happens, two (or more) jobmanager.log files are created with the older
   having ".N" added as a suffix to the name. Due to the previously used
   contains method, we ended up picking the older JobManager log file.
   These logs wouldn't contain the TaskManager startup log message which
   is required by the assertion.
   
   I wasn't able to figure out why we sometimes experience multiple
   JobManager startups. I checked the Hadoop code for the
   DefaultContainerExecutor and the DEBUG logs for YARN. I couldn't find
   any indication for a restart.
   
   But Flink renames older log files and keeps the most-recent one as
   jobmanager.log. That's the one we're interested, anyway. Hence,
   selecting "jobmanager.log" through equals solves the unstable test.
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to