[jira] [Work logged] (GOBBLIN-1822) Logging Abnormal Helix Task States

ASF GitHub Bot (Jira) Mon, 24 Apr 2023 13:20:09 -0700


     [ 
https://issues.apache.org/jira/browse/GOBBLIN-1822?focusedWorklogId=858780&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-858780
 ]


ASF GitHub Bot logged work on GOBBLIN-1822:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 24/Apr/23 20:19
            Start Date: 24/Apr/23 20:19
    Worklog Time Spent: 10m 
      Work Description: ZihanLi58 commented on code in PR #3685:
URL: https://github.com/apache/gobblin/pull/3685#discussion_r1175743668


##########
gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnAutoScalingManager.java:
##########
@@ -222,8 +223,22 @@ void runInternal() {
           if (jobContext != null) {
             log.debug("JobContext {} num partitions {}", jobContext, 
jobContext.getPartitionSet().size());
 
-            
inUseInstances.addAll(jobContext.getPartitionSet().stream().map(jobContext::getAssignedParticipant)
-                .filter(Objects::nonNull).collect(Collectors.toSet()));
+            inUseInstances.addAll(jobContext.getPartitionSet().stream().map(i 
-> {
+              if(jobContext.getPartitionState(i) == null) {
+                return jobContext.getAssignedParticipant(i);
+              }
+              if (!jobContext.getPartitionState(i).equals(
+                  TaskPartitionState.ERROR) && 
!jobContext.getPartitionState(i).equals(

Review Comment:
   here are some of my consideration
   1. null check is to prevent the NPE exception when I try to compare the task 
state with a specific value and try to return getPartitionState specifically to 
maintain backward compatibility. But I'll refactor the code a little bit to 
make it look clean
   2. To ensure that logs are helpful and not noisy, I will reduce the amount 
of information logged for retriable task states. Even if the instances are 
added to the in-use map, they will be removed automatically during the next run 
of the method as retry assigns them to new instances, causing old ones to be 
removed automatically.
   3. To address the issue of tasks failing multiple times, I will add a log 
for tasks that have a high number of attempts. This will be clearer than 
logging the unusual task state every time.





Issue Time Tracking
-------------------

    Worklog Id:     (was: 858780)
    Time Spent: 40m  (was: 0.5h)

> Logging Abnormal Helix Task States
> ----------------------------------
>
>                 Key: GOBBLIN-1822
>                 URL: https://issues.apache.org/jira/browse/GOBBLIN-1822
>             Project: Apache Gobblin
>          Issue Type: Improvement
>            Reporter: Zihan Li
>            Priority: Major
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> Currently, in the autoScalingManager, we iterate through all Helix tasks 
> without logging their statuses. This means that if any issues occur and we 
> need to restart the pipeline, we lose the Helix status information, making it 
> difficult to investigate the problem further.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Work logged] (GOBBLIN-1822) Logging Abnormal Helix Task States

Reply via email to