[
https://issues.apache.org/jira/browse/GOBBLIN-1822?focusedWorklogId=858780&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-858780
]
ASF GitHub Bot logged work on GOBBLIN-1822:
-------------------------------------------
Author: ASF GitHub Bot
Created on: 24/Apr/23 20:19
Start Date: 24/Apr/23 20:19
Worklog Time Spent: 10m
Work Description: ZihanLi58 commented on code in PR #3685:
URL: https://github.com/apache/gobblin/pull/3685#discussion_r1175743668
##########
gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnAutoScalingManager.java:
##########
@@ -222,8 +223,22 @@ void runInternal() {
if (jobContext != null) {
log.debug("JobContext {} num partitions {}", jobContext,
jobContext.getPartitionSet().size());
-
inUseInstances.addAll(jobContext.getPartitionSet().stream().map(jobContext::getAssignedParticipant)
- .filter(Objects::nonNull).collect(Collectors.toSet()));
+ inUseInstances.addAll(jobContext.getPartitionSet().stream().map(i
-> {
+ if(jobContext.getPartitionState(i) == null) {
+ return jobContext.getAssignedParticipant(i);
+ }
+ if (!jobContext.getPartitionState(i).equals(
+ TaskPartitionState.ERROR) &&
!jobContext.getPartitionState(i).equals(
Review Comment:
here are some of my consideration
1. null check is to prevent the NPE exception when I try to compare the task
state with a specific value and try to return getPartitionState specifically to
maintain backward compatibility. But I'll refactor the code a little bit to
make it look clean
2. To ensure that logs are helpful and not noisy, I will reduce the amount
of information logged for retriable task states. Even if the instances are
added to the in-use map, they will be removed automatically during the next run
of the method as retry assigns them to new instances, causing old ones to be
removed automatically.
3. To address the issue of tasks failing multiple times, I will add a log
for tasks that have a high number of attempts. This will be clearer than
logging the unusual task state every time.
Issue Time Tracking
-------------------
Worklog Id: (was: 858780)
Time Spent: 40m (was: 0.5h)
> Logging Abnormal Helix Task States
> ----------------------------------
>
> Key: GOBBLIN-1822
> URL: https://issues.apache.org/jira/browse/GOBBLIN-1822
> Project: Apache Gobblin
> Issue Type: Improvement
> Reporter: Zihan Li
> Priority: Major
> Time Spent: 40m
> Remaining Estimate: 0h
>
> Currently, in the autoScalingManager, we iterate through all Helix tasks
> without logging their statuses. This means that if any issues occur and we
> need to restart the pipeline, we lose the Helix status information, making it
> difficult to investigate the problem further.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)