[ 
https://issues.apache.org/jira/browse/GOBBLIN-1822?focusedWorklogId=858819&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-858819
 ]

ASF GitHub Bot logged work on GOBBLIN-1822:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 25/Apr/23 01:35
            Start Date: 25/Apr/23 01:35
    Worklog Time Spent: 10m 
      Work Description: homatthew commented on code in PR #3685:
URL: https://github.com/apache/gobblin/pull/3685#discussion_r1175928075


##########
gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnAutoScalingManager.java:
##########
@@ -222,8 +223,22 @@ void runInternal() {
           if (jobContext != null) {
             log.debug("JobContext {} num partitions {}", jobContext, 
jobContext.getPartitionSet().size());
 
-            
inUseInstances.addAll(jobContext.getPartitionSet().stream().map(jobContext::getAssignedParticipant)
-                .filter(Objects::nonNull).collect(Collectors.toSet()));
+            inUseInstances.addAll(jobContext.getPartitionSet().stream().map(i 
-> {
+              if(jobContext.getPartitionState(i) == null) {
+                return jobContext.getAssignedParticipant(i);
+              }
+              if (!jobContext.getPartitionState(i).equals(
+                  TaskPartitionState.ERROR) && 
!jobContext.getPartitionState(i).equals(

Review Comment:
   This change makes sense if Helix always assigns the task to a new 
participant on the next run. I still have some concerns about visibility when 
Helix doesn't reassign the task





Issue Time Tracking
-------------------

    Worklog Id:     (was: 858819)
    Time Spent: 50m  (was: 40m)

> Logging Abnormal Helix Task States
> ----------------------------------
>
>                 Key: GOBBLIN-1822
>                 URL: https://issues.apache.org/jira/browse/GOBBLIN-1822
>             Project: Apache Gobblin
>          Issue Type: Improvement
>            Reporter: Zihan Li
>            Priority: Major
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> Currently, in the autoScalingManager, we iterate through all Helix tasks 
> without logging their statuses. This means that if any issues occur and we 
> need to restart the pipeline, we lose the Helix status information, making it 
> difficult to investigate the problem further.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to