John Interrante created DAFFODIL-2751:
-----------------------------------------

             Summary: Occasional network timeout exceptions can hang a CI job 
now
                 Key: DAFFODIL-2751
                 URL: https://issues.apache.org/jira/browse/DAFFODIL-2751
             Project: Daffodil
          Issue Type: Bug
          Components: Infrastructure
    Affects Versions: 3.5.0
            Reporter: John Interrante


Please see these 2 runs in GitHub Actions:

[Add Daffodil Developer Guide · apache/daffodil@9d114c3 
(github.com)|https://github.com/apache/daffodil/actions/runs/3464760904/jobs/5786683343]

[Add Daffodil Developer Guide · apache/daffodil@0bc99e6 
(github.com)|https://github.com/apache/daffodil/actions/runs/3475210535/jobs/5809186675]

One job in both runs hanged for 5 hours 54 minutes so GitHub Actions had to 
kill the job.  Both jobs were running on the same runner (Java 8, Scala 
2.12.17, ubuntu-20.04) and had failed in the following unit tests with the same 
error message:

org.apache.daffodil.io.TestInputSourceDataInputStream8.networkReadPartial1 

org.apache.daffodil.io.TestSocketPairTestRig.testHangDetection1

org.apache.daffodil.io.TestSocketPairTestRig.testHangDetection2

org.apache.daffodil.io.TestSocketPairTestRig.testSocketPairTestRig1

failed: java.util.concurrent.TimeoutException: Futures timed out after [1000 
milliseconds], took 1.002 sec

The rest of the jobs ran all of the unit tests successfully without any timeout 
exceptions.  We have had an occasional timeout exception fail 1 out of 6 jobs 
in a run before but they had not caused the job to hang before (the job had 
simply terminated after running the unit tests).

I do not think there was a change in the GitHub Actions runner.  I checked the 
last CI job on the main branch ([Update sbt to 1.8.0 · apache/daffodil@6d4b2b6 
(github.com)|https://github.com/apache/daffodil/actions/runs/3462161126/jobs/5780684309])
 and the runner version numbers were the same in the setup job details.  We 
have had several CI jobs since the recent changes to the integration tests so 
it seems unlikely they had anything to do with the new hangs, even though hangs 
can happen due to non-daemon threads still running in a JVM.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to