[ 
https://issues.apache.org/jira/browse/TEZ-4646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18017627#comment-18017627
 ] 

Ayush Saxena commented on TEZ-4646:
-----------------------------------

Committed to master.

Thanx [~abstractdog] for the review!!!

> Periodic jstack collection for tez (tez.thread.dump.interval) only collects 
> jstacks once.
> -----------------------------------------------------------------------------------------
>
>                 Key: TEZ-4646
>                 URL: https://issues.apache.org/jira/browse/TEZ-4646
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Urmas Tamassy
>            Assignee: Ayush Saxena
>            Priority: Major
>          Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> *Issue description:*
> https://issues.apache.org/jira/browse/TEZ-4344 intends allow users to 
> configure periodic jstack collection for tez AM(dag) and executor (task) 
> containers.
> Unfortunately the current implementation only allows a single jstack 
> collection after an initial delay via the tez.thread.dump.interval 
> configuration.
> The issue seems to be due to the improper use of the ScheduledExecutorService 
> schedule method where it seems to be more appropriate to use 
> scheduleAtFixedRate or scheduleWithFixedDelay.
> [https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ScheduledExecutorService.html#schedule-java.lang.Runnable-long-java.util.concurrent.TimeUnit-]
> [https://github.com/apache/tez/blob/c981a9459c48b5e0c49fb4197173a910dcf7a39a/tez-runtime-internals/src/main/java/org/apache/tez/runtime/TezThreadDumpHelper.java#L94]
> *Reproduction* (tested on CDP7.1.9SP1, but based on code it appears to affect 
> all releases):
> {code:java}
> set hive.fetch.task.conversion=none;
> -- set hive.security.authorization.sqlstd.confwhitelist.append to 
> tez\.thread\.dump\.interval ahead of time
> set tez.thread.dump.interval=3s;
> -- also set hive.server2.builtin.udf.blacklist to a dummy value ahead of time 
> to allow reflects
> select java_method("java.lang.Thread","sleep",10000L);{code}
> With a 10s duration we would expect 2-3 jstacks (depending on initial delay), 
> but we only receive 1 after 3 seconds. Log snippets:
> {code:java}
> Container: container_e06_1756713041685_0004_01_000002 on 
> ccycloud-3.tamassyurmas.root.comops.site_8041
> LogAggregationType: AGGREGATED
> ======================================================================================================
> LogType:syslog_attempt_1756713041685_0004_1_00_000000_0
> ...
> 2025-09-01 10:04:42,948 [INFO] [main] |runtime.TezThreadDumpHelper|: Periodic 
> Thread Dump Capture Service Configured to capture Thread Dumps at 3000 ms 
> frequency and at path: 
> /var/log/hadoop-yarn/container/application_1756713041685_0004/container_e06_1756713041685_0004_01_000002{code}
> {code:java}
> LogType:attempt_1756713041685_0004_1_00_000000_0_1756721085950.jstack{code}
> 1756721085950 = Mon Sep 1 10:04:45 UTC 2025 which confirms a stack dump after 
> 3 seconds, but no other are observed for the same task attempt or the dag 
> (also only a single dump). Attaching the app logs of the same.
> *Expectation/severity:*
> The feature should allow periodic collections rather than a singular 
> collection after an initial delay, please ensure the feature works as 
> expected. The initial delay might also need to be configurable.
> The feature simplifies the investigation of long-running or stuck tez 
> applications where jstacks of specific containers (Yarn) or K8s 
> executor/coordinator pod processes (CDW) may be necessary. Manual collection 
> may be difficult or near-impossible in certain situations and as such it is a 
> valuable diagnostic feature.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to