[ 
https://issues.apache.org/jira/browse/TEZ-4725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18091080#comment-18091080
 ] 

Mahesh Raju Somalaraju edited comment on TEZ-4725 at 6/24/26 5:54 AM:
----------------------------------------------------------------------

Afer fixing the above two flaky tests, i observed some other tests failed due 
to these changes. So i put the details related all the failed test here and 
pushing as part of this JIRA.

 

1)TestAMRecoveryAggregationBroadcast.testMapJoinTemporalFailure

Failure:expected: <1> but was:<0> in recovery log assertions
Root Cause:Fixed Thread.sleep(10s) before AM kill — TABLE_SCAN/AGGREGATION may 
not finish in 10s on a loaded CI node

2)TestAMRecoveryAggregationBroadcast.*
Failure: IllegalArgumentException: port out of range:-1
Root Cause: New AM container allocated (state=RUNNING) but rpcPort==-1; guard 
only checked == 0

3)TestHistoryParser.testParserWithSuccessfulJob
Failure:JSONException: A JSONObject text must begin with '{'
Root Cause: ATS flush lag + ZipOutputStream not closed on exception = empty zip 
entry read by parser

4)TestMRRJobsDAGApi.testMultipleMRRSleepJobViaSession
Failure:expected: <READY> but was:<RUNNING>
Root Cause: DAG SUCCEEDED ≠ AM session READY; cleanup window visible on slow CI

5)TestMRRJobsDAGApi.testNonDefaultFSStagingDir
Failure: test timed out after 60000ms in waitNonSessionTillReady
Root Cause: waitNonSessionTillReady() had no deadline; looped forever when AM 
port was not yet bound

6)TestMRRJobsDAGApi.testMRRSleepJobDagSubmit (and siblings)
Failure:test timed out after 60000ms
Same root cause; @Test(timeout=60000) fired before the new 120s internal 
deadline


was (Author: maheshrajus):
Afer fixing the above two flaky tests i observed some other tests failed due to 
these changes. So i put the details related all the failed test here and 
pushing as part of this JIRA.

 

1)TestAMRecoveryAggregationBroadcast.testMapJoinTemporalFailure

Failure:expected: <1> but was:<0> in recovery log assertions
Root Cause:Fixed Thread.sleep(10s) before AM kill — TABLE_SCAN/AGGREGATION may 
not finish in 10s on a loaded CI node

2)TestAMRecoveryAggregationBroadcast.*
Failure: IllegalArgumentException: port out of range:-1
Root Cause: New AM container allocated (state=RUNNING) but rpcPort==-1; guard 
only checked == 0

3)TestHistoryParser.testParserWithSuccessfulJob
Failure:JSONException: A JSONObject text must begin with '{'
Root Cause: ATS flush lag + ZipOutputStream not closed on exception = empty zip 
entry read by parser

4)TestMRRJobsDAGApi.testMultipleMRRSleepJobViaSession
Failure:expected: <READY> but was:<RUNNING>
Root Cause: DAG SUCCEEDED ≠ AM session READY; cleanup window visible on slow CI

5)TestMRRJobsDAGApi.testNonDefaultFSStagingDir
Failure: test timed out after 60000ms in waitNonSessionTillReady
Root Cause: waitNonSessionTillReady() had no deadline; looped forever when AM 
port was not yet bound

6)TestMRRJobsDAGApi.testMRRSleepJobDagSubmit (and siblings)
Failure:test timed out after 60000ms
Same root cause; @Test(timeout=60000) fired before the new 120s internal 
deadline

> TestAMRecoveryAggregationBroadcast.testMapJoinTemporalFailure is flaky
> ----------------------------------------------------------------------
>
>                 Key: TEZ-4725
>                 URL: https://issues.apache.org/jira/browse/TEZ-4725
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: László Bodor
>            Assignee: Mahesh Raju Somalaraju
>            Priority: Major
>         Attachments: 
> TEST-org.apache.tez.test.TestAMRecoveryAggregationBroadcast.xml, 
> org.apache.tez.test.TestAMRecoveryAggregationBroadcast-output.txt, 
> patch-unit-root.txt
>
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> {code}
> [ERROR] Tests run: 4, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 114.8 
> s <<< FAILURE! -- in org.apache.tez.test.TestAMRecoveryAggregationBroadcast
> [ERROR] 
> org.apache.tez.test.TestAMRecoveryAggregationBroadcast.testMapJoinTemporalFailure
>  -- Time elapsed: 15.51 s <<< ERROR!
> java.lang.IllegalArgumentException: port out of range:-1
>       at 
> java.base/java.net.InetSocketAddress.checkPort(InetSocketAddress.java:153)
>       at 
> java.base/java.net.InetSocketAddress.<init>(InetSocketAddress.java:198)
>       at 
> org.apache.hadoop.net.NetUtils.createSocketAddrForHost(NetUtils.java:320)
>       at 
> org.apache.tez.client.TezClientUtils.getAMProxy(TezClientUtils.java:973)
>       at 
> org.apache.tez.dag.api.client.rpc.DAGClientRPCImpl.createAMProxyIfNeeded(DAGClientRPCImpl.java:290)
>       at 
> org.apache.tez.dag.api.client.rpc.DAGClientRPCImpl.getDAGStatus(DAGClientRPCImpl.java:101)
>       at 
> org.apache.tez.dag.api.client.DAGClientImpl.getDAGStatusViaAM(DAGClientImpl.java:406)
>       at 
> org.apache.tez.dag.api.client.DAGClientImpl.getDAGStatusInternal(DAGClientImpl.java:247)
>       at 
> org.apache.tez.dag.api.client.DAGClientImpl.getDAGStatus(DAGClientImpl.java:297)
>       at 
> org.apache.tez.dag.api.client.DAGClientImpl.getDAGStatus(DAGClientImpl.java:169)
>       at 
> org.apache.tez.dag.api.client.DAGClientImpl._waitForCompletionWithStatusUpdates(DAGClientImpl.java:573)
>       at 
> org.apache.tez.dag.api.client.DAGClientImpl.waitForCompletionWithStatusUpdates(DAGClientImpl.java:384)
>       at 
> org.apache.tez.test.TestAMRecoveryAggregationBroadcast.runDAGAndVerify(TestAMRecoveryAggregationBroadcast.java:351)
>       at 
> org.apache.tez.test.TestAMRecoveryAggregationBroadcast.testMapJoinTemporalFailure(TestAMRecoveryAggregationBroadcast.java:262)
>       at 
> java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
>       at java.base/java.lang.reflect.Method.invoke(Method.java:580)
>       at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
>       at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>       at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
>       at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>       at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:299)
>       at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:293)
>       at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
>       at java.base/java.lang.Thread.run(Thread.java:1583)
> {code}
> without looking into this in detail, a 10 seconds sleep + am kill seems quite 
> error-prone, maybe there is a room for improvement here:
> https://github.com/apache/tez/blob/cba56893fe4b759c08a1cd9ce7d1434b6818795f/tez-tests/src/test/java/org/apache/tez/test/TestAMRecoveryAggregationBroadcast.java#L343



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to