[
https://issues.apache.org/jira/browse/TEZ-4725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18091080#comment-18091080
]
Mahesh Raju Somalaraju commented on TEZ-4725:
---------------------------------------------
Afer fixing the above two flaky tests i observed some other tests failed due to
these changes. So i put the details related all the failed test here and
pushing as part of this JIRA.
1)TestAMRecoveryAggregationBroadcast.testMapJoinTemporalFailure
Failure:expected: <1> but was:<0> in recovery log assertions
Root Cause:Fixed Thread.sleep(10s) before AM kill — TABLE_SCAN/AGGREGATION may
not finish in 10s on a loaded CI node
2)TestAMRecoveryAggregationBroadcast.*
Failure: IllegalArgumentException: port out of range:-1
Root Cause: New AM container allocated (state=RUNNING) but rpcPort==-1; guard
only checked == 0
3)TestHistoryParser.testParserWithSuccessfulJob
Failure:JSONException: A JSONObject text must begin with '{'
Root Cause: ATS flush lag + ZipOutputStream not closed on exception = empty zip
entry read by parser
4)TestMRRJobsDAGApi.testMultipleMRRSleepJobViaSession
Failure:expected: <READY> but was:<RUNNING>
Root Cause: DAG SUCCEEDED ≠ AM session READY; cleanup window visible on slow CI
5)TestMRRJobsDAGApi.testNonDefaultFSStagingDir
Failure: test timed out after 60000ms in waitNonSessionTillReady
Root Cause: waitNonSessionTillReady() had no deadline; looped forever when AM
port was not yet bound
6)TestMRRJobsDAGApi.testMRRSleepJobDagSubmit (and siblings)
Failure:test timed out after 60000ms
Same root cause; @Test(timeout=60000) fired before the new 120s internal
deadline
> TestAMRecoveryAggregationBroadcast.testMapJoinTemporalFailure is flaky
> ----------------------------------------------------------------------
>
> Key: TEZ-4725
> URL: https://issues.apache.org/jira/browse/TEZ-4725
> Project: Apache Tez
> Issue Type: Bug
> Reporter: László Bodor
> Assignee: Mahesh Raju Somalaraju
> Priority: Major
> Attachments:
> TEST-org.apache.tez.test.TestAMRecoveryAggregationBroadcast.xml,
> org.apache.tez.test.TestAMRecoveryAggregationBroadcast-output.txt,
> patch-unit-root.txt
>
> Time Spent: 50m
> Remaining Estimate: 0h
>
> {code}
> [ERROR] Tests run: 4, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 114.8
> s <<< FAILURE! -- in org.apache.tez.test.TestAMRecoveryAggregationBroadcast
> [ERROR]
> org.apache.tez.test.TestAMRecoveryAggregationBroadcast.testMapJoinTemporalFailure
> -- Time elapsed: 15.51 s <<< ERROR!
> java.lang.IllegalArgumentException: port out of range:-1
> at
> java.base/java.net.InetSocketAddress.checkPort(InetSocketAddress.java:153)
> at
> java.base/java.net.InetSocketAddress.<init>(InetSocketAddress.java:198)
> at
> org.apache.hadoop.net.NetUtils.createSocketAddrForHost(NetUtils.java:320)
> at
> org.apache.tez.client.TezClientUtils.getAMProxy(TezClientUtils.java:973)
> at
> org.apache.tez.dag.api.client.rpc.DAGClientRPCImpl.createAMProxyIfNeeded(DAGClientRPCImpl.java:290)
> at
> org.apache.tez.dag.api.client.rpc.DAGClientRPCImpl.getDAGStatus(DAGClientRPCImpl.java:101)
> at
> org.apache.tez.dag.api.client.DAGClientImpl.getDAGStatusViaAM(DAGClientImpl.java:406)
> at
> org.apache.tez.dag.api.client.DAGClientImpl.getDAGStatusInternal(DAGClientImpl.java:247)
> at
> org.apache.tez.dag.api.client.DAGClientImpl.getDAGStatus(DAGClientImpl.java:297)
> at
> org.apache.tez.dag.api.client.DAGClientImpl.getDAGStatus(DAGClientImpl.java:169)
> at
> org.apache.tez.dag.api.client.DAGClientImpl._waitForCompletionWithStatusUpdates(DAGClientImpl.java:573)
> at
> org.apache.tez.dag.api.client.DAGClientImpl.waitForCompletionWithStatusUpdates(DAGClientImpl.java:384)
> at
> org.apache.tez.test.TestAMRecoveryAggregationBroadcast.runDAGAndVerify(TestAMRecoveryAggregationBroadcast.java:351)
> at
> org.apache.tez.test.TestAMRecoveryAggregationBroadcast.testMapJoinTemporalFailure(TestAMRecoveryAggregationBroadcast.java:262)
> at
> java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
> at java.base/java.lang.reflect.Method.invoke(Method.java:580)
> at
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
> at
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> at
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
> at
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> at
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:299)
> at
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:293)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
> at java.base/java.lang.Thread.run(Thread.java:1583)
> {code}
> without looking into this in detail, a 10 seconds sleep + am kill seems quite
> error-prone, maybe there is a room for improvement here:
> https://github.com/apache/tez/blob/cba56893fe4b759c08a1cd9ce7d1434b6818795f/tez-tests/src/test/java/org/apache/tez/test/TestAMRecoveryAggregationBroadcast.java#L343
--
This message was sent by Atlassian Jira
(v8.20.10#820010)