[ https://issues.apache.org/jira/browse/TEZ-4646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18017627#comment-18017627 ]
Ayush Saxena commented on TEZ-4646: ----------------------------------- Committed to master. Thanx [~abstractdog] for the review!!! > Periodic jstack collection for tez (tez.thread.dump.interval) only collects > jstacks once. > ----------------------------------------------------------------------------------------- > > Key: TEZ-4646 > URL: https://issues.apache.org/jira/browse/TEZ-4646 > Project: Apache Tez > Issue Type: Bug > Reporter: Urmas Tamassy > Assignee: Ayush Saxena > Priority: Major > Time Spent: 1h 20m > Remaining Estimate: 0h > > *Issue description:* > https://issues.apache.org/jira/browse/TEZ-4344 intends allow users to > configure periodic jstack collection for tez AM(dag) and executor (task) > containers. > Unfortunately the current implementation only allows a single jstack > collection after an initial delay via the tez.thread.dump.interval > configuration. > The issue seems to be due to the improper use of the ScheduledExecutorService > schedule method where it seems to be more appropriate to use > scheduleAtFixedRate or scheduleWithFixedDelay. > [https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ScheduledExecutorService.html#schedule-java.lang.Runnable-long-java.util.concurrent.TimeUnit-] > [https://github.com/apache/tez/blob/c981a9459c48b5e0c49fb4197173a910dcf7a39a/tez-runtime-internals/src/main/java/org/apache/tez/runtime/TezThreadDumpHelper.java#L94] > *Reproduction* (tested on CDP7.1.9SP1, but based on code it appears to affect > all releases): > {code:java} > set hive.fetch.task.conversion=none; > -- set hive.security.authorization.sqlstd.confwhitelist.append to > tez\.thread\.dump\.interval ahead of time > set tez.thread.dump.interval=3s; > -- also set hive.server2.builtin.udf.blacklist to a dummy value ahead of time > to allow reflects > select java_method("java.lang.Thread","sleep",10000L);{code} > With a 10s duration we would expect 2-3 jstacks (depending on initial delay), > but we only receive 1 after 3 seconds. Log snippets: > {code:java} > Container: container_e06_1756713041685_0004_01_000002 on > ccycloud-3.tamassyurmas.root.comops.site_8041 > LogAggregationType: AGGREGATED > ====================================================================================================== > LogType:syslog_attempt_1756713041685_0004_1_00_000000_0 > ... > 2025-09-01 10:04:42,948 [INFO] [main] |runtime.TezThreadDumpHelper|: Periodic > Thread Dump Capture Service Configured to capture Thread Dumps at 3000 ms > frequency and at path: > /var/log/hadoop-yarn/container/application_1756713041685_0004/container_e06_1756713041685_0004_01_000002{code} > {code:java} > LogType:attempt_1756713041685_0004_1_00_000000_0_1756721085950.jstack{code} > 1756721085950 = Mon Sep 1 10:04:45 UTC 2025 which confirms a stack dump after > 3 seconds, but no other are observed for the same task attempt or the dag > (also only a single dump). Attaching the app logs of the same. > *Expectation/severity:* > The feature should allow periodic collections rather than a singular > collection after an initial delay, please ensure the feature works as > expected. The initial delay might also need to be configurable. > The feature simplifies the investigation of long-running or stuck tez > applications where jstacks of specific containers (Yarn) or K8s > executor/coordinator pod processes (CDW) may be necessary. Manual collection > may be difficult or near-impossible in certain situations and as such it is a > valuable diagnostic feature. -- This message was sent by Atlassian Jira (v8.20.10#820010)