[
https://issues.apache.org/jira/browse/TEZ-4230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17191011#comment-17191011
]
László Bodor edited comment on TEZ-4230 at 9/5/20, 4:44 PM:
------------------------------------------------------------
I think this is caused by TEZ-3897, which seems to involve a race condition by
[future.cancel(true)|https://github.com/apache/tez/commit/c34e46c73218bf21a0219f3004e20cbedaad92f4#diff-a1849ff607725cf1b84d74e78823ca3cR305]
in the hive tests mentioned above, we can see hanging with tez 0.9.2 and 0.10.0
(staging artifact), and the issue now seems clear to me based on
[^TestCrudCompactorOnTez.log]
somehow the task's heartbeat thread is interrupted while the AsyncDispatcher is
handling the event, and the last log message before the "AsyncDispatcher thread
interrupted" is "Stopping containerId", so I suspect that
LocalContainerLauncher cancels the task runnable, and won't wait for the
heartbeat to be processed fully...cc: [~jeagles], [~jlowe] wondering if this
makes sense to you...before TEZ-3897 LocalContainerLauncher totally ignored
task callback on container stop, after TEZ-3897 "future.cancel(true)" seems to
be quite strict under some circumstances...I'm about to test the flaky hive
test somehow with
[future.cancel(false)|https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/Future.html?is-external=true#cancel-boolean-]
UPDATE: testing with future.cancel(false) is in progress (built a 0.9.2.1
artifact and
[deployed|https://repository.apache.org/content/repositories/orgapachetez-1069/])
[http://ci.hive.apache.org/job/hive-flaky-check/103/console]
http://ci.hive.apache.org/job/hive-flaky-check/104/console
[http://ci.hive.apache.org/job/hive-flaky-check/105/console]
[http://ci.hive.apache.org/job/hive-flaky-check/106/console]
[http://ci.hive.apache.org/job/hive-flaky-check/107/console]
was (Author: abstractdog):
I think this is caused by TEZ-3897, which seems to involve a race condition by
[future.cancel(true)|https://github.com/apache/tez/commit/c34e46c73218bf21a0219f3004e20cbedaad92f4#diff-a1849ff607725cf1b84d74e78823ca3cR305]
in the hive tests mentioned above, we can see hanging with tez 0.9.2 and 0.10.0
(staging artifact), and the issue now seems clear to me based on
[^TestCrudCompactorOnTez.log]
somehow the task's heartbeat thread is interrupted while the AsyncDispatcher is
handling the event, and the last log message before the "AsyncDispatcher thread
interrupted" is "Stopping containerId", so I suspect that
LocalContainerLauncher cancels the task runnable, and won't wait for the
heartbeat to be processed fully...cc: [~jeagles], [~jlowe] wondering if this
makes sense to you...before TEZ-3897 LocalContainerLauncher totally ignored
task callback on container stop, after TEZ-3897 "future.cancel(true)" seems to
be quite strict under some circumstances...I'm about to test the flaky hive
test somehow with
[future.cancel(false)|https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/Future.html?is-external=true#cancel-boolean-]
UPDATE: testing with future.cancel(false) is in progress (built a 0.9.2.1
artifact and
[deployed|https://repository.apache.org/content/repositories/orgapachetez-1069/])
[http://ci.hive.apache.org/job/hive-flaky-check/103/console]
[http://ci.hive.apache.org/job/hive-flaky-check/104/console
http://ci.hive.apache.org/job/hive-flaky-check/105/console
http://ci.hive.apache.org/job/hive-flaky-check/106/console
http://ci.hive.apache.org/job/hive-flaky-check/107/console|http://ci.hive.apache.org/job/hive-flaky-check/104/console]
> TestMmCompactorOnTez/TestCrudCompactorOnTez hangs when running against Tez
> 0.10.0 staging artifact
> --------------------------------------------------------------------------------------------------
>
> Key: TEZ-4230
> URL: https://issues.apache.org/jira/browse/TEZ-4230
> Project: Apache Tez
> Issue Type: Bug
> Reporter: László Bodor
> Assignee: László Bodor
> Priority: Major
> Attachments: TEZ-4230.01.patch, TestCrudCompactorOnTez.log,
> TestCrudCompactorOnTez2.log, jstack.log,
> org.apache.hadoop.hive.ql.txn.compactor.TestCrudCompactorOnTez-output.txt
>
>
> Reproduced issue in ptest run which I made to run against tez staging
> artifacts
> (https://repository.apache.org/content/repositories/orgapachetez-1068/)
> http://ci.hive.apache.org/blue/organizations/jenkins/hive-precommit/detail/PR-1311/14/pipeline/417
> I'm about to investigate this. I think Tez 0.10.0 cannot be released until we
> won't confirm if it's a hive or tez bug.
> {code}
> mvn test -Pitests,hadoop-2 -Dtest=TestMmCompactorOnTez -pl ./itests/hive-unit
> {code}
> tez setup:
> https://github.com/apache/hive/commit/92516631ab39f39df5d0692f98ac32c2cd320997#diff-a22bcc9ba13b310c7abfee4a57c4b130R83-R97
--
This message was sent by Atlassian Jira
(v8.3.4#803005)