[ 
https://issues.apache.org/jira/browse/DAFFODIL-2751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Lawrence resolved DAFFODIL-2751.
--------------------------------------
    Resolution: Fixed

Fixed in commits:

2fd1f1947094d8da3db3b34463ad7038a01c857a
2fd1f1947094d8da3db3b34463ad7038a01c857a

Note that this removed timeouts that could lead to a test never hanging.

One possible cause is a bug in either Daffodil or the ExpectIt library. This is 
pretty unlikely for a number a reasons. However, if a CLI test does hang, the 
likely case is that the test body for the CLI tests is broken, usually in a way 
where the CLI expects data from stdin and the CLI is blocked waiting for that 
data. Two likely examples of this:

1. The CLI test is parsing data and needs to hit EOF before it can finish, but 
the CLI test body never close stdin. The CLI is stuck is a state of expecting 
more data but never getting it. In this case, the test body should use 
cli.closeInput() or set inputDone = true in a sendLine(...) function call.

2. The CLI is configured with the interactive debugger and has hit a break 
point, so is waiting for debugger commands, but the test body isn't providing 
anymore commands. The fix in this case is to modify the test body to send 
additional debugger commands, such as "quit" to end debugging.

> Occasional network timeout exceptions can hang a CI job now
> -----------------------------------------------------------
>
>                 Key: DAFFODIL-2751
>                 URL: https://issues.apache.org/jira/browse/DAFFODIL-2751
>             Project: Daffodil
>          Issue Type: Bug
>          Components: Infrastructure
>    Affects Versions: 3.5.0
>            Reporter: John Interrante
>            Assignee: Steve Lawrence
>            Priority: Minor
>             Fix For: 3.5.0
>
>
> Please see these 2 runs in GitHub Actions:
> [Add Daffodil Developer Guide · apache/daffodil@9d114c3 
> (github.com)|https://github.com/apache/daffodil/actions/runs/3464760904/jobs/5786683343]
> [Add Daffodil Developer Guide · apache/daffodil@0bc99e6 
> (github.com)|https://github.com/apache/daffodil/actions/runs/3475210535/jobs/5809186675]
> One job in both runs hanged for 5 hours 54 minutes so GitHub Actions had to 
> kill the job.  Both jobs were running on the same runner (Java 8, Scala 
> 2.12.17, ubuntu-20.04) and had failed in the following unit tests with the 
> same error message:
> org.apache.daffodil.io.TestInputSourceDataInputStream8.networkReadPartial1 
> org.apache.daffodil.io.TestSocketPairTestRig.testHangDetection1
> org.apache.daffodil.io.TestSocketPairTestRig.testHangDetection2
> org.apache.daffodil.io.TestSocketPairTestRig.testSocketPairTestRig1
> failed: java.util.concurrent.TimeoutException: Futures timed out after [1000 
> milliseconds], took 1.002 sec
> The rest of the jobs ran all of the unit tests successfully without any 
> timeout exceptions.  We have had an occasional timeout exception fail 1 out 
> of 6 jobs in a run before but they had not caused the job to hang before (the 
> job had simply terminated after running the unit tests).
> I do not think there was a change in the GitHub Actions runner.  I checked 
> the last CI job on the main branch ([Update sbt to 1.8.0 · 
> apache/daffodil@6d4b2b6 
> (github.com)|https://github.com/apache/daffodil/actions/runs/3462161126/jobs/5780684309])
>  and the runner version numbers were the same in the setup job details.  We 
> have had several CI jobs since the recent changes to the integration tests so 
> it seems unlikely they had anything to do with the new hangs, even though 
> hangs can happen due to non-daemon threads still running in a JVM.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to