Urmas Tamassy created TEZ-4646:
----------------------------------

             Summary: Periodic jstack collection for tez 
(tez.thread.dump.interval) only collects jstacks once.
                 Key: TEZ-4646
                 URL: https://issues.apache.org/jira/browse/TEZ-4646
             Project: Apache Tez
          Issue Type: Bug
            Reporter: Urmas Tamassy


*Issue description:*

https://issues.apache.org/jira/browse/TEZ-4344 intends allow users to configure 
periodic jstack collection for tez AM(dag) and executor (task) containers.

Unfortunately the current implementation only allows a single jstack collection 
after an initial delay via the tez.thread.dump.interval configuration.

The issue seems to be due to the improper use of the ScheduledExecutorService 
schedule method where it seems to be more appropriate to use 
scheduleAtFixedRate or scheduleWithFixedDelay.

https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ScheduledExecutorService.html#schedule-java.lang.Runnable-long-java.util.concurrent.TimeUnit-

https://github.com/apache/tez/blob/c981a9459c48b5e0c49fb4197173a910dcf7a39a/tez-runtime-internals/src/main/java/org/apache/tez/runtime/TezThreadDumpHelper.java#L94

*Reproduction* (tested on CDP7.1.9SP1, but based on code it appears to affect 
all releases):

set hive.fetch.task.conversion=none;
– set hive.security.authorization.sqlstd.confwhitelist.append to 
tez\.thread\.dump\.interval ahead of time
set tez.thread.dump.interval=3s;
– also set hive.server2.builtin.udf.blacklist to a dummy value ahead of time to 
allow reflects
select java_method("java.lang.Thread","sleep",10000L);

With a 10s duration we would expect 2-3 jstacks (depending on initial delay), 
but we only receive 1 after 3 seconds. Log snippets:

Container: container_e06_1756713041685_0004_01_000002 on 
ccycloud-3.tamassyurmas.root.comops.site_8041
LogAggregationType: AGGREGATED
======================================================================================================
LogType:syslog_attempt_1756713041685_0004_1_00_000000_0
...
2025-09-01 10:04:42,948 [INFO] [main] |runtime.TezThreadDumpHelper|: Periodic 
Thread Dump Capture Service Configured to capture Thread Dumps at 3000 ms 
frequency and at path: 
/var/log/hadoop-yarn/container/application_1756713041685_0004/container_e06_1756713041685_0004_01_000002

LogType:attempt_1756713041685_0004_1_00_000000_0_1756721085950.jstack

1756721085950 = Mon Sep 1 10:04:45 UTC 2025 which confirms a stack dump after 3 
seconds, but no other are observed for the same task attempt or the dag (also 
only a single dump). Attaching the app logs of the same.

*Expectation/severity:*

The feature should allow periodic collections rather than a singular collection 
after an initial delay, please ensure the feature works as expected. The 
initial delay might also need to be configurable.

The feature simplifies the investigation of long-running or stuck tez 
applications where jstacks of specific containers (Yarn) or K8s 
executor/coordinator pod processes (CDW) may be necessary. Manual collection 
may be difficult or near-impossible in certain situations and as such it is a 
valuable diagnostic feature.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to