Urmas Tamassy created TEZ-4646:
----------------------------------
Summary: Periodic jstack collection for tez
(tez.thread.dump.interval) only collects jstacks once.
Key: TEZ-4646
URL: https://issues.apache.org/jira/browse/TEZ-4646
Project: Apache Tez
Issue Type: Bug
Reporter: Urmas Tamassy
*Issue description:*
https://issues.apache.org/jira/browse/TEZ-4344 intends allow users to configure
periodic jstack collection for tez AM(dag) and executor (task) containers.
Unfortunately the current implementation only allows a single jstack collection
after an initial delay via the tez.thread.dump.interval configuration.
The issue seems to be due to the improper use of the ScheduledExecutorService
schedule method where it seems to be more appropriate to use
scheduleAtFixedRate or scheduleWithFixedDelay.
https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ScheduledExecutorService.html#schedule-java.lang.Runnable-long-java.util.concurrent.TimeUnit-
https://github.com/apache/tez/blob/c981a9459c48b5e0c49fb4197173a910dcf7a39a/tez-runtime-internals/src/main/java/org/apache/tez/runtime/TezThreadDumpHelper.java#L94
*Reproduction* (tested on CDP7.1.9SP1, but based on code it appears to affect
all releases):
set hive.fetch.task.conversion=none;
– set hive.security.authorization.sqlstd.confwhitelist.append to
tez\.thread\.dump\.interval ahead of time
set tez.thread.dump.interval=3s;
– also set hive.server2.builtin.udf.blacklist to a dummy value ahead of time to
allow reflects
select java_method("java.lang.Thread","sleep",10000L);
With a 10s duration we would expect 2-3 jstacks (depending on initial delay),
but we only receive 1 after 3 seconds. Log snippets:
Container: container_e06_1756713041685_0004_01_000002 on
ccycloud-3.tamassyurmas.root.comops.site_8041
LogAggregationType: AGGREGATED
======================================================================================================
LogType:syslog_attempt_1756713041685_0004_1_00_000000_0
...
2025-09-01 10:04:42,948 [INFO] [main] |runtime.TezThreadDumpHelper|: Periodic
Thread Dump Capture Service Configured to capture Thread Dumps at 3000 ms
frequency and at path:
/var/log/hadoop-yarn/container/application_1756713041685_0004/container_e06_1756713041685_0004_01_000002
LogType:attempt_1756713041685_0004_1_00_000000_0_1756721085950.jstack
1756721085950 = Mon Sep 1 10:04:45 UTC 2025 which confirms a stack dump after 3
seconds, but no other are observed for the same task attempt or the dag (also
only a single dump). Attaching the app logs of the same.
*Expectation/severity:*
The feature should allow periodic collections rather than a singular collection
after an initial delay, please ensure the feature works as expected. The
initial delay might also need to be configurable.
The feature simplifies the investigation of long-running or stuck tez
applications where jstacks of specific containers (Yarn) or K8s
executor/coordinator pod processes (CDW) may be necessary. Manual collection
may be difficult or near-impossible in certain situations and as such it is a
valuable diagnostic feature.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)