[jira] [Work logged] (BEAM-10291) Lull detection log to include full thread dump

ASF GitHub Bot (Jira) Tue, 07 Jul 2020 12:10:10 -0700


     [ 
https://issues.apache.org/jira/browse/BEAM-10291?focusedWorklogId=455660&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-455660
 ]


ASF GitHub Bot logged work on BEAM-10291:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 07/Jul/20 19:09
            Start Date: 07/Jul/20 19:09
    Worklog Time Spent: 10m 
      Work Description: pabloem commented on a change in pull request #12143:
URL: https://github.com/apache/beam/pull/12143#discussion_r451078998



##########
File path: 
runners/google-cloud-dataflow-java/worker/src/main/java/org/apache/beam/runners/dataflow/worker/DataflowOperationContext.java
##########
@@ -194,6 +200,18 @@ public DataflowExecutionState(
       this.metricsContainer = metricsContainer;
     }
 
+    public DataflowExecutionState(
+        NameContext nameContext,
+        String stateName,
+        @Nullable String requestingStepName,
+        @Nullable Integer inputIndex,
+        @Nullable MetricsContainer metricsContainer,
+        ProfileScope profileScope,
+        Clock clock) {
+      this(nameContext, stateName, requestingStepName, inputIndex, 
metricsContainer, profileScope);
+      this.clock = clock;
+    }

Review comment:
       Instead, add Clock to the constructor above (lines 188-200), make the 
clock member final, and pass Clock.SYSTEM in the constructor without it?

##########
File path: 
runners/google-cloud-dataflow-java/worker/src/main/java/org/apache/beam/runners/dataflow/worker/DataflowOperationContext.java
##########
@@ -264,7 +276,49 @@ public void reportLull(Thread trackedThread, long millis) {
       logRecord.setLoggerName(DataflowOperationContext.LOG.getName());
 
       // Publish directly in the context of this specific ExecutionState.
-      DataflowWorkerLoggingInitializer.getLoggingHandler().publish(this, 
logRecord);
+      DataflowWorkerLoggingHandler dataflowLoggingHandler =
+          DataflowWorkerLoggingInitializer.getLoggingHandler();
+      dataflowLoggingHandler.publish(this, logRecord);
+
+      if (shouldLogFullThreadDump()) {
+        Map<Thread, StackTraceElement[]> threadSet = 
Thread.getAllStackTraces();
+        for (Map.Entry<Thread, StackTraceElement[]> entry : 
threadSet.entrySet()) {
+          Thread thread = entry.getKey();
+          StackTraceElement[] stackTrace = entry.getValue();
+          StringBuilder message = new StringBuilder();
+          message.append(thread.toString()).append(":\n");
+          message.append(getStackTraceForLullMessage(stackTrace));
+          logRecord = new LogRecord(Level.INFO, message.toString());
+          logRecord.setLoggerName(DataflowOperationContext.LOG.getName());
+          dataflowLoggingHandler.publish(this, logRecord);
+        }
+      }
+    }
+
+    // A full thread dump is performed at most once every 20 minutes.
+    private static final long LOG_LULL_FULL_THREAD_DUMP_MS = 20 * 60 * 1000;
+
+    // Last time when a full thread dump was performed.
+    private long lastFullThreadDumpMillis = 0;
+
+    private boolean shouldLogFullThreadDump() {

Review comment:
       This means that a full thread dump will happen the very first time a 
lull is logged. Lulls are not that unusual, right? We want full thread dumps 
for pipelines that are truly stuck, not for pipelines that are experiencing a 
slow stage.
   
   Perhaps we should use the `millis` argument from `reportLull`, and only 
report when millis passes 20 minutes?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 455660)
    Time Spent: 4.5h  (was: 4h 20m)

> Lull detection log to include full thread dump
> ----------------------------------------------
>
>                 Key: BEAM-10291
>                 URL: https://issues.apache.org/jira/browse/BEAM-10291
>             Project: Beam
>          Issue Type: Improvement
>          Components: runner-dataflow
>            Reporter: David Yan
>            Assignee: David Yan
>            Priority: P2
>          Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> What we have today is a thread dump of the thread that's stuck, but in many 
> cases (most notably BQ) I/O happens in a separate thread that is not included 
> in the dump. Ideally, we'd need to have a full thread dump of the entire 
> process.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-10291) Lull detection log to include full thread dump

Reply via email to