[ https://issues.apache.org/jira/browse/HADOOP-6064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720536#action_12720536 ]
Hemanth Yamijala commented on HADOOP-6064: ------------------------------------------ Just for information, the failure this time around happened as follows: - The test timed out in multipleQsWithOneQBeyondCapacity, while waiting for 5 map tasks to complete. - The check for completion of tasks assumes all map tasks run successfully in ControlledMapReduceJob. Note that the check is on jip.finishedMaps() which does not count failed tasks. - However, one of the map tasks failed this time, with the following stack trace: {noformat} [junit] 09/06/17 12:49:20 INFO mapred.TaskInProgress: Error from attempt_200906171248_0001_m_000003_0: java.io.FileNotFoundException: File signalFileDir-7646601804912829477/MAPS_0 does not exist. [junit] at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:383) [junit] at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:301) [junit] at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:746) [junit] at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:771) [junit] at org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:465) [junit] at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:746) [junit] at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:806) [junit] at org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:936) [junit] at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:891) [junit] at org.apache.hadoop.mapred.ControlledMapReduceJob.listSignalFiles(ControlledMapReduceJob.java:278) [junit] at org.apache.hadoop.mapred.ControlledMapReduceJob.map(ControlledMapReduceJob.java:318) [junit] at org.apache.hadoop.mapred.ControlledMapReduceJob.map(ControlledMapReduceJob.java:60) [junit] at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) [junit] at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:363) [junit] at org.apache.hadoop.mapred.MapTask.run(MapTask.java:312) [junit] at org.apache.hadoop.mapred.Child.main(Child.java:159) {noformat} - This, in turn, seems to relate to the problem described in HADOOP-4167. The mappers all list contents of a filesystem looking for 'signal' files. These signal files are renamed and therefore go missing asynchronously. - The test waits forever and thus times out. > Rewrite TestQueueCapacities to make it simpler and avoid timeout errors > ----------------------------------------------------------------------- > > Key: HADOOP-6064 > URL: https://issues.apache.org/jira/browse/HADOOP-6064 > Project: Hadoop Core > Issue Type: Bug > Components: contrib/capacity-sched, test > Affects Versions: 0.20.0 > Reporter: Hemanth Yamijala > > We have seen TestQueueCapacities fail periodically and there have been a > couple of times fixes partially fixed the problem, the most recent instance > being HADOOP-5869. I found another instance of failure, while running tests > locally while testing a different patch. This was a different symptom from > the ones we've seen before. The core problem is that the test is too complex > and relies on too many things working correctly to be useful. It would make > sense to revisit the purpose of the test and see if a simpler model can serve > it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.