[ 
https://issues.apache.org/jira/browse/DAFFODIL-2751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17634920#comment-17634920
 ] 

Mike Beckerle commented on DAFFODIL-2751:
-----------------------------------------

These tests actually do have asynchrony and race conditions in them; hence, 
they have timeouts, etc. 

Under very very heavily loaded servers, these tests may simply not work because 
it's almost impossible to anticipate the order of execution. Seemingly simple 
assumptions like if you spawn a thread it will start within 10 seconds, may 
simply not hold. 

We should revisit these tests. They probably just have race conditions that 
allow them to fail. 

> Occasional network timeout exceptions can hang a CI job now
> -----------------------------------------------------------
>
>                 Key: DAFFODIL-2751
>                 URL: https://issues.apache.org/jira/browse/DAFFODIL-2751
>             Project: Daffodil
>          Issue Type: Bug
>          Components: Infrastructure
>    Affects Versions: 3.5.0
>            Reporter: John Interrante
>            Priority: Major
>
> Please see these 2 runs in GitHub Actions:
> [Add Daffodil Developer Guide · apache/daffodil@9d114c3 
> (github.com)|https://github.com/apache/daffodil/actions/runs/3464760904/jobs/5786683343]
> [Add Daffodil Developer Guide · apache/daffodil@0bc99e6 
> (github.com)|https://github.com/apache/daffodil/actions/runs/3475210535/jobs/5809186675]
> One job in both runs hanged for 5 hours 54 minutes so GitHub Actions had to 
> kill the job.  Both jobs were running on the same runner (Java 8, Scala 
> 2.12.17, ubuntu-20.04) and had failed in the following unit tests with the 
> same error message:
> org.apache.daffodil.io.TestInputSourceDataInputStream8.networkReadPartial1 
> org.apache.daffodil.io.TestSocketPairTestRig.testHangDetection1
> org.apache.daffodil.io.TestSocketPairTestRig.testHangDetection2
> org.apache.daffodil.io.TestSocketPairTestRig.testSocketPairTestRig1
> failed: java.util.concurrent.TimeoutException: Futures timed out after [1000 
> milliseconds], took 1.002 sec
> The rest of the jobs ran all of the unit tests successfully without any 
> timeout exceptions.  We have had an occasional timeout exception fail 1 out 
> of 6 jobs in a run before but they had not caused the job to hang before (the 
> job had simply terminated after running the unit tests).
> I do not think there was a change in the GitHub Actions runner.  I checked 
> the last CI job on the main branch ([Update sbt to 1.8.0 · 
> apache/daffodil@6d4b2b6 
> (github.com)|https://github.com/apache/daffodil/actions/runs/3462161126/jobs/5780684309])
>  and the runner version numbers were the same in the setup job details.  We 
> have had several CI jobs since the recent changes to the integration tests so 
> it seems unlikely they had anything to do with the new hangs, even though 
> hangs can happen due to non-daemon threads still running in a JVM.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to