[
https://issues.apache.org/jira/browse/GOBBLIN-1979?focusedWorklogId=896920&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-896920
]
ASF GitHub Bot logged work on GOBBLIN-1979:
-------------------------------------------
Author: ASF GitHub Bot
Created on: 22/Dec/23 14:48
Start Date: 22/Dec/23 14:48
Worklog Time Spent: 10m
Work Description: phet commented on code in PR #3850:
URL: https://github.com/apache/gobblin/pull/3850#discussion_r1435129018
##########
gobblin-runtime/src/main/java/org/apache/gobblin/runtime/TaskStateCollectorService.java:
##########
@@ -252,22 +255,32 @@ public boolean apply(String input) {
}
final Queue<TaskState> taskStateQueue = Queues.newConcurrentLinkedQueue();
+ AtomicLong numStateStoreMissing = new AtomicLong(0L);
+ GrowthMilestoneTracker growthTracker = new GrowthMilestoneTracker();
try (ParallelRunner stateSerDeRunner = new
ParallelRunner(numDeserializerThreads, null)) {
for (final String taskStateName : taskStateNames) {
log.debug("Found output task state file " + taskStateName);
// Deserialize the TaskState and delete the file
stateSerDeRunner.submitCallable(new Callable<Void>() {
@Override
public Void call() throws Exception {
- TaskState taskState = taskStateStore.getAll(taskStateTableName,
taskStateName).get(0);
- taskStateQueue.add(taskState);
+ List<TaskState> matchingTaskStates =
taskStateStore.getAll(taskStateTableName, taskStateName);
+ if (matchingTaskStates.isEmpty()) {
Review Comment:
correct: this solely addresses cases where the state store does not retrieve
the task state, but otherwise exits normally. perhaps in another sort of
failure, a state store impl might throw. this consolidation still permits such
failure to pass through uninterrupted.
since the state store already gave us the list of task state names on line
244, I'd expect any other such failure to be ephemeral (else an abject logical
bug in the state store). either way, I've avoided over-engineering the
solution, precisely, as you point out, because we'd lose valuable debugging
info by conflating dissimilar errors.
if a future failure scenario should arise from which we gain a concrete
grasp on what kind of errors these might be, I'd suggest at that time to extend
this solution.
Issue Time Tracking
-------------------
Worklog Id: (was: 896920)
Time Spent: 0.5h (was: 20m)
> Pare down TaskStateCollectorService failure logging, to avoid flooding logs
> during widespread failure, e.g. O(1k)+
> ------------------------------------------------------------------------------------------------------------------
>
> Key: GOBBLIN-1979
> URL: https://issues.apache.org/jira/browse/GOBBLIN-1979
> Project: Apache Gobblin
> Issue Type: Bug
> Components: gobblin-core
> Reporter: Kip Kohn
> Assignee: Abhishek Tiwari
> Priority: Major
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
> Logging Task state collector failure at the granularity of every task is
> impractical, when tasks number in the 100k's.
> This arose because the dest-side volume enforced the namespace quota, which
> left over 100k+ WUs failing. so while not every day, this is a normal
> occurrence and therefore deserves graceful handling.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)