[ 
https://issues.apache.org/jira/browse/TEZ-4230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17191011#comment-17191011
 ] 

László Bodor edited comment on TEZ-4230 at 9/5/20, 10:11 AM:
-------------------------------------------------------------

I think this is caused by TEZ-3897, which seems to involve a race condition by 
[future.cancel(true)|https://github.com/apache/tez/commit/c34e46c73218bf21a0219f3004e20cbedaad92f4#diff-a1849ff607725cf1b84d74e78823ca3cR305]

in the hive tests mentioned above, we can see hangs 0.9.2 and 0.10.0 (staging 
artifact), and the issue now seems clear to me based on  
[^TestCrudCompactorOnTez.log] 

somehow the task's heartbeat thread is interrupted while the AsyncDispatcher is 
handling the event, and the last log message before the "AsyncDispatcher thread 
interrupted" is "Stopping containerId", so I suspect that 
LocalContainerLauncher cancels the task runnable, and won't wait for the 
heartbeat to be processed fully...cc: [~jeagles],  [~jlowe] wondering if this 
makes sense to you...before TEZ-3897 LocalContainerLauncher totally ignored 
task callback on container stop, after TEZ-3897 "future.cancel(true)" seems to 
be quite strict under some circumstances...I'm about to test the flaky hive 
test somehow with 
[future.cancel(false)|https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/Future.html?is-external=true#cancel-boolean-]

UPDATE: testing with future.cancel(false) is in progress:
http://ci.hive.apache.org/job/hive-flaky-check/103/console


was (Author: abstractdog):
I think this is caused by TEZ-3897, which seems to involve a race condition by 
[future.cancel(true)|https://github.com/apache/tez/commit/c34e46c73218bf21a0219f3004e20cbedaad92f4#diff-a1849ff607725cf1b84d74e78823ca3cR305]

in the hive tests mentioned above, we can see hangs 0.9.2 and 0.10.0 (staging 
artifact), and the issue now seems clear to me based on  
[^TestCrudCompactorOnTez.log] 

somehow the task's heartbeat thread is interrupted while the AsyncDispatcher is 
handling the event, and the last log message before the "AsyncDispatcher thread 
interrupted" is "Stopping containerId", so I suspect that 
LocalContainerLauncher cancels the task runnable, and won't wait for the 
heartbeat to be processed fully...cc: [~jeagles],  [~jlowe] wondering if this 
makes sense to you...before TEZ-3897 LocalContainerLauncher totally ignored 
task callback on container stop, after TEZ-3897 "future.cancel(true)" seems to 
be quite strict under some circumstances...I'm about to test the flaky hive 
test somehow with 
[future.cancel(false)|https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/Future.html?is-external=true#cancel-boolean-]


> TestMmCompactorOnTez/TestCrudCompactorOnTez hangs when running against Tez 
> 0.10.0 staging artifact
> --------------------------------------------------------------------------------------------------
>
>                 Key: TEZ-4230
>                 URL: https://issues.apache.org/jira/browse/TEZ-4230
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: László Bodor
>            Assignee: László Bodor
>            Priority: Major
>         Attachments: TestCrudCompactorOnTez.log, TestCrudCompactorOnTez2.log, 
> jstack.log, 
> org.apache.hadoop.hive.ql.txn.compactor.TestCrudCompactorOnTez-output.txt
>
>
> Reproduced issue in ptest run which I made to run against tez staging 
> artifacts 
> (https://repository.apache.org/content/repositories/orgapachetez-1068/)
> http://ci.hive.apache.org/blue/organizations/jenkins/hive-precommit/detail/PR-1311/14/pipeline/417
> I'm about to investigate this. I think Tez 0.10.0 cannot be released until we 
> won't confirm if it's a hive or tez bug.
> {code}
> mvn test -Pitests,hadoop-2 -Dtest=TestMmCompactorOnTez -pl ./itests/hive-unit
> {code}
> tez setup:
> https://github.com/apache/hive/commit/92516631ab39f39df5d0692f98ac32c2cd320997#diff-a22bcc9ba13b310c7abfee4a57c4b130R83-R97



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to