hudi-bot opened a new issue, #16221:
URL: https://github.com/apache/hudi/issues/16221

   ## JIRA info
   
   - Link: https://issues.apache.org/jira/browse/HUDI-6820
   - Type: Bug
   - Epic: https://issues.apache.org/jira/browse/HUDI-4302
   
   
   ---
   
   
   ## Comments
   
   06/Sep/23 17:17;ljain;Siva had tried disabling tests to see if they were 
causing timeouts via PRs -
   
   [https://github.com/apache/hudi/pull/9543]
   
   [https://github.com/apache/hudi/pull/9550/files]
   
   [https://github.com/apache/hudi/pull/9542]
   
   [https://github.com/apache/hudi/pull/9551]
   
   But the timeout issue was still visible.;;;
   
   ---
   
   06/Sep/23 17:17;ljain;For some runs like - 
[https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=19384&view=logs&j=ba200224-5437-5e21-2643-114ac65587f4]
 attempt 2 and 3 here, the timeout occurs after we see 87 tests have run.
   
   For some of the others like 
[https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/19500/logs/7],
 timeout occurs after we see 15 tests have run.;;;
   
   ---
   
   06/Sep/23 17:22;ljain;Some older runs where similar issue can be seen:
   FT client/spark-client:
   
[https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=18864&view=logs&j=7601efb9-4019-552e-11ba-eb31b66593b2|https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=18152&view=logs&j=dcedfe73-9485-5cc5-817a-73b61fc5dcb0]
 
   UT FT other modules
   
[https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=18152&view=logs&j=dcedfe73-9485-5cc5-817a-73b61fc5dcb0];;;
   
   ---
   
   06/Sep/23 17:26;ljain;Usually the error message is of the form:
   {code:java}
   ,##[error]We stopped hearing from agent Hosted Agent. Verify the agent 
machine is running and has a healthy network connection. Anything that 
terminates an agent process, starves it for CPU, or blocks its network access 
can cause this error. For more information, see: 
https://go.microsoft.com/fwlink/?linkid=846610 {code}
   Or of the form:
   {code:java}
   The job running on agent Azure Pipelines 7 ran longer than the maximum time 
of 150 minutes. {code}
   It could be an actual timeout where tests were running for 150 minutes. But 
in many cases issue we are seeing is that raw logs show
   {code:java}
   2023-07-28T06:01:14.7898970Z         at 
java.util.ArrayList.forEach(ArrayList.java:1259) ~[?:1.8.0_372]
   2023-07-28T06:01:14.7899483Z         at 
org.apache.hudi.metrics.Metrics.shutdown(Metrics.java:116) 
~[hudi-client-common-0.14.0-SNAPSHOT.jar:0.14.0-SNAPSHOT]
   2023-07-28T06:01:14.7899862Z         at 
java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_372]
   2023-07-28T07:27:41.0195722Z ##[error]The operation was canceled.
   2023-07-28T07:27:41.0242402Z ##[section]Finishing: FT client/spark-client
    {code}
   There is a gap of 1 hr and more between last test run and the operation 
cancellation.;;;
   
   ---
   
   06/Sep/23 17:29;ljain;Found a microsoft developer ticket for similar issue 
where Azure CI is getting timeout. 
[https://developercommunity.visualstudio.com/t/errorthe-operation-was-canceled-azure-devops-build/692048?viewtype=all];;;
   
   ---
   
   06/Sep/23 17:30;ljain;Created 
[https://github.com/apache/hudi/pull/962|https://github.com/apache/hudi/pull/9627]7
 and [https://github.com/apache/hudi/pull/9628] with disabled 
{{TestHoodieRealtimeRecordReader}} .
   
   UT FT timed out after 4 hours in both the PRs above.
   
   Also FT client/spark-client also timed out in the CI run
   
   
[https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=19671&view=logs&j=7601efb9-4019-552e-11ba-eb31b66593b2&t=9688f101-287d-53f4-2a80-87202516f5d0];;;
   
   ---
   
   06/Sep/23 17:43;ljain;Created [https://github.com/apache/hudi/pull/9604] , 
PRs 9605-9609, [https://github.com/apache/hudi/pull/9632] and 
[https://github.com/apache/hudi/pull/9633]  for testing Azure CI on 0.13.0 hudi 
branch.;;;
   
   ---
   
   07/Sep/23 13:19;ljain;The maven version is 3.8.8 and has not changed since 
June. and we don't have azure runs before that.;;;
   
   ---
   
   07/Sep/23 18:21;ljain;First known occurrence of timeout issue (12th June) - 
[https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=17760&view=results];;;
   
   ---
   
   08/Sep/23 16:50;ljain;PRs [https://github.com/apache/hudi/pull/9604] to 9607 
had been run on 0.13.0 branch. Out of total 8 PRs, 4 failed with timeout and 4 
others were cancelled because of other critical PRs. The issue is reproducible 
with 0.13.0.;;;
   
   ---
   
   08/Sep/23 16:53;ljain;Created a PR to move hudi-hadoop-mr and 
hudi-java-client to github action. [https://github.com/apache/hudi/pull/9661];;;
   
   ---
   
   08/Sep/23 16:53;ljain;Created a PR which disables deltastreamer continuous 
mode tests. [https://github.com/apache/hudi/pull/9662];;;
   
   ---
   
   08/Sep/23 17:02;ljain;Also created PRs 
[https://github.com/apache/hudi/pull/9654/files] to 9657 for moving 
hudi-hadoop-mr and hudi-java-client to a separate job. Timeout errors are still 
seen in the UT FT jobs.;;;
   
   ---
   
   11/Sep/23 17:21;ljain;Created PRs https://github.com/apache/hudi/pull/9676 
to 9678 after disabling TestHoodieCombineHiveInputFormat. Was seeing flakiness 
in github check for separated hudi-mr and java-client module. Out of 3 prs, 2 
passed but 1 will timeout.;;;
   
   ---
   
   11/Sep/23 17:42;ljain;Created [https://github.com/apache/hudi/pull/9682] to 
add 40 minutes timeout for hudi-hadoop-mr check. The github check takes 6 hours 
to timeout currently.;;;
   
   ---
   
   11/Sep/23 17:48;ljain;Created [https://github.com/apache/hudi/pull/9683] 
after enabling debug logs for maven. Only the newly added github check would 
run here. Should ideally print out more logs.;;;
   
   ---
   
   12/Sep/23 16:15;ljain;| Created PRs 
[https://github.com/apache/hudi/pull/9676] to 9678 after disabling 
TestHoodieCombineHiveInputFormat
   
   Usually I see timeout in hudi-java-client after disabling this test. 
Otherwise we see timeout in both hadoop-mr and java-client.;;;
   
   ---
   
   12/Sep/23 16:16;ljain;After enabling more debug logs, created a few PRs.
   
https://github.com/apache/hudi/actions/runs/6158416987/job/16711145162?pr=9690 .
   I can see that all annotations with @After complete before timeout;;;


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to