[ 
https://issues.apache.org/jira/browse/GOBBLIN-1979?focusedWorklogId=896920&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-896920
 ]

ASF GitHub Bot logged work on GOBBLIN-1979:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 22/Dec/23 14:48
            Start Date: 22/Dec/23 14:48
    Worklog Time Spent: 10m 
      Work Description: phet commented on code in PR #3850:
URL: https://github.com/apache/gobblin/pull/3850#discussion_r1435129018


##########
gobblin-runtime/src/main/java/org/apache/gobblin/runtime/TaskStateCollectorService.java:
##########
@@ -252,22 +255,32 @@ public boolean apply(String input) {
     }
 
     final Queue<TaskState> taskStateQueue = Queues.newConcurrentLinkedQueue();
+    AtomicLong numStateStoreMissing = new AtomicLong(0L);
+    GrowthMilestoneTracker growthTracker = new GrowthMilestoneTracker();
     try (ParallelRunner stateSerDeRunner = new 
ParallelRunner(numDeserializerThreads, null)) {
       for (final String taskStateName : taskStateNames) {
         log.debug("Found output task state file " + taskStateName);
         // Deserialize the TaskState and delete the file
         stateSerDeRunner.submitCallable(new Callable<Void>() {
           @Override
           public Void call() throws Exception {
-            TaskState taskState = taskStateStore.getAll(taskStateTableName, 
taskStateName).get(0);
-            taskStateQueue.add(taskState);
+            List<TaskState> matchingTaskStates = 
taskStateStore.getAll(taskStateTableName, taskStateName);
+            if (matchingTaskStates.isEmpty()) {

Review Comment:
   correct: this solely addresses cases where the state store does not retrieve 
the task state, but otherwise exits normally.  perhaps in another sort of 
failure, a state store impl might throw.  this consolidation still permits such 
failure to pass through uninterrupted.
   
   since the state store already gave us the list of task state names on line 
244, I'd expect any other such failure to be ephemeral (else an abject logical 
bug in the state store).  either way, I've avoided over-engineering the 
solution, precisely, as you point out, because we'd lose valuable debugging 
info by conflating dissimilar errors.
   
   if a future failure scenario should arise from which we gain a concrete 
grasp on what kind of errors these might be, I'd suggest at that time to extend 
this solution.





Issue Time Tracking
-------------------

    Worklog Id:     (was: 896920)
    Time Spent: 0.5h  (was: 20m)

> Pare down TaskStateCollectorService failure logging, to avoid flooding logs 
> during widespread failure, e.g. O(1k)+
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: GOBBLIN-1979
>                 URL: https://issues.apache.org/jira/browse/GOBBLIN-1979
>             Project: Apache Gobblin
>          Issue Type: Bug
>          Components: gobblin-core
>            Reporter: Kip Kohn
>            Assignee: Abhishek Tiwari
>            Priority: Major
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Logging Task state collector failure at the granularity of every task is 
> impractical, when tasks number in the 100k's.
> This arose because the dest-side volume enforced the namespace quota, which 
> left over 100k+ WUs failing. so while not every day, this is a normal 
> occurrence and therefore deserves graceful handling.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to