[jira] Commented: (HADOOP-6064) Rewrite TestQueueCapacities to make it simpler and avoid timeout errors

Hemanth Yamijala (JIRA) Wed, 17 Jun 2009 00:52:35 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-6064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720536#action_12720536
 ]


Hemanth Yamijala commented on HADOOP-6064:
------------------------------------------

Just for information, the failure this time around happened as follows:

- The test timed out in multipleQsWithOneQBeyondCapacity, while waiting for 5 
map tasks to complete.
- The check for completion of tasks assumes all map tasks run successfully in 
ControlledMapReduceJob. Note that the check is on jip.finishedMaps() which  
does not count failed tasks.
- However, one of the map tasks failed this time, with the following stack 
trace:
{noformat}
    [junit] 09/06/17 12:49:20 INFO mapred.TaskInProgress: Error from 
attempt_200906171248_0001_m_000003_0: java.io.FileNotFoundException: File 
signalFileDir-7646601804912829477/MAPS_0 does not exist.
    [junit]   at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:383)
    [junit]   at 
org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:301)
    [junit]   at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:746)
    [junit]   at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:771)
    [junit]   at 
org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:465)
    [junit]   at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:746)
    [junit]   at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:806)
    [junit]   at 
org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:936)
    [junit]   at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:891)
    [junit]   at 
org.apache.hadoop.mapred.ControlledMapReduceJob.listSignalFiles(ControlledMapReduceJob.java:278)
    [junit]   at 
org.apache.hadoop.mapred.ControlledMapReduceJob.map(ControlledMapReduceJob.java:318)
    [junit]   at 
org.apache.hadoop.mapred.ControlledMapReduceJob.map(ControlledMapReduceJob.java:60)
    [junit]   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
    [junit]   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:363)
    [junit]   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:312)
    [junit]   at org.apache.hadoop.mapred.Child.main(Child.java:159)
{noformat}
- This, in turn, seems to relate to the problem described in HADOOP-4167. The 
mappers all list contents of a filesystem looking for 'signal' files. These 
signal files are renamed and therefore go missing asynchronously.
- The test waits forever and thus times out.

> Rewrite TestQueueCapacities to make it simpler and avoid timeout errors
> -----------------------------------------------------------------------
>
>                 Key: HADOOP-6064
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6064
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/capacity-sched, test
>    Affects Versions: 0.20.0
>            Reporter: Hemanth Yamijala
>
> We have seen TestQueueCapacities fail periodically and there have been a 
> couple of times fixes partially fixed the problem, the most recent instance 
> being HADOOP-5869. I found another instance of failure, while running tests 
> locally while testing a different patch. This was a different symptom from 
> the ones we've seen before. The core problem is that the test is too complex 
> and relies on too many things working correctly to be useful. It would make 
> sense to revisit the purpose of the test and see if a simpler model can serve 
> it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-6064) Rewrite TestQueueCapacities to make it simpler and avoid timeout errors

Reply via email to