pratapaditya04 opened a new pull request, #4197:
URL: https://github.com/apache/gobblin/pull/4197

   ### JIRA
   - [x] My PR addresses the following Gobblin JIRA issue and references it in 
the PR title.
       - GOBBLIN-XXXX (follow-up to the AM shutdown-hook GTE work in commit 
`dd6b26b70`)
   
   ### Description
   
   Today GaaS relies on the GobblinTrackingEvent (GTE) stream as the source of 
truth for job outcomes, but two gaps make the signal unreliable for the 
temporal-on-yarn flow:
   
   1. **Missing terminal GTEs on AM failure paths.** The previous commit on 
this branch (`dd6b26b70` — Emit job-completion GTE from AM shutdown hook as 
single source of truth) installed a JVM-shutdown-hook GTE emitter inside ` + 
"`GobblinTemporalJobLauncher.emitJobCompletionGTE()`" + `. That covers graceful 
AM shutdown. It does NOT cover JVM-fatal paths — OOM, SIGKILL, container-kill 
before the hook registers, AM CLI parse error — where the AM JVM dies before 
the hook can run. From GaaS's perspective the job is stuck.
   2. **Wrong exit status.** Both ` + 
"`GobblinTemporalApplicationMaster.main()`" + ` and ` + 
"`GobblinYarnAppLauncher.main()`" + ` exited 0 on essentially every code path. 
The AM only ` + "`System.exit(1)`" + `s on CLI parse failure; on actual job 
failure it returned cleanly through the try-with-resources and the JVM exited 
0. The launcher had no ` + "`System.exit`" + ` at all. This masked real 
failures from Grid Gateway dashboards.
   
   This PR closes both gaps:
   
   **AM-side exit code** (` + "`gobblin-temporal`" + `)
   - ` + "`GobblinTemporalJobLauncher`" + ` caches the terminal ` + 
"`WorkflowExecutionStatus`" + ` into a static field, populated either by ` + 
"`handleLaunchFinalization`" + ` (normal flow, while Temporal stubs are alive) 
or by ` + "`emitJobCompletionGTE`" + ` (JVM shutdown hook).
   - Exposes ` + "`getLastTerminalStatus()`" + ` and static ` + 
"`computeExitCode(WorkflowExecutionStatus)`" + ` so the AM main can read the 
cache after ` + "`close()`" + ` has shut down Temporal stubs.
   - ` + "`GobblinTemporalApplicationMaster.main()`" + ` reads the cache after 
` + "`start()`" + ` returns and calls ` + "`System.exit(0|1)`" + `, surfacing 
job failures as non-zero AM JVM exit codes.
   
   **GGW launcher** (` + "`gobblin-yarn/GobblinYarnAppLauncher`" + `)
   - ` + "`handleApplicationReportArrivalEvent`" + ` now calls a new ` + 
"`handleTerminalAppStatus`" + `, which maps ` + "`FinalApplicationStatus`" + ` 
→ ` + "`JOB_FAILED`" + ` / ` + "`JOB_CANCEL`" + ` (SUCCEEDED → no synthetic 
GTE, since the AM hook covers it), emits with the same event names and metadata 
schema as the AM-side hook so ` + "`KafkaAvroJobStatusMonitor`" + ` consumes 
both uniformly, and sets the launcher's exit code to 1.
   - ` + "`handleGetApplicationReportFailureEvent`" + ` exhaustion → ` + 
"`JOB_FAILED`" + ` with ` + "`failureReason = GGW_LOST_AM_VISIBILITY`" + ` 
(distinct from ` + "`GGW_OBSERVED_AM_TERMINATION`" + `) so triage can tell the 
two apart, plus ` + "`exitCode = 1`" + `.
   - ` + "`main()`" + ` now ends with ` + 
"`System.exit(launcher.getExitCode())`" + `.
   - Emission is idempotency-guarded by an ` + "`AtomicBoolean`" + ` to handle 
EventBus replays and the race between the two handlers.
   
   Scope: temporal flow only. Shared schema is reused (` + 
"`TimingEvent.LauncherTimings.{JOB_SUCCEEDED, JOB_FAILED, JOB_CANCEL}`" + `, ` 
+ "`TimingEvent.FlowEventConstants.{FLOW_GROUP_FIELD, FLOW_NAME_FIELD, 
FLOW_EXECUTION_ID_FIELD, JOB_NAME_FIELD, JOB_GROUP_FIELD}`" + `, ` + 
"`ConfigurationKeys.{FLOW_GROUP_KEY, ...}`" + `).
   
   ### Tests
   - [x] My PR adds the following unit tests.
   
   **` + "`GobblinTemporalJobLauncherTest`" + `** (7 new cases):
   - ` + "`testComputeExitCodeForCompletedReturnsZero`" + ` / ` + 
"`testComputeExitCodeForAnyNonCompletedReturnsOne`" + ` cover the AM main's 
static exit-code helper for every ` + "`WorkflowExecutionStatus`" + `.
   - ` + "`testHandleLaunchFinalizationPopulatesLastTerminalStatus`" + ` 
verifies normal-flow status capture.
   - ` + "`testLastTerminalStatusSurvivesClose`" + ` ensures the cache outlives 
` + "`close()`" + ` shutting down Temporal stubs (this is what the AM main 
relies on).
   - ` + "`testEmitJobCompletionGTEDoesNotReQueryWhenStatusAlreadyCached`" + ` 
verifies the hook reuses the cached value rather than re-querying Temporal 
during shutdown.
   - ` + "`testConstructorResetsStaleTerminalStatus`" + ` guards against 
cross-test cache leakage (matters for any JVM that reuses launchers).
   
   **` + "`GobblinYarnAppLauncherTerminalGteTest`" + `** (new file, 16 cases): 
focused unit tests that avoid the heavy ` + "`MiniYARNCluster`" + ` + Helix 
integration setup used by ` + "`GobblinYarnAppLauncherTest`" + `, by injecting 
fields with reflection on a CALLS_REAL_METHODS Mockito stand-in. Covers ` + 
"`mapFinalAppStatusToEventName`" + ` for every ` + "`FinalApplicationStatus`" + 
`, ` + "`handleTerminalAppStatus`" + ` exit-code wiring + idempotency, ` + 
"`handleLostAmVisibility`" + ` exit-code + idempotency + interaction with 
already-emitted GTEs, and ` + "`buildLauncherTerminalMetadata`" + ` (flow-id 
population from ` + "`Config`" + `, diagnostics truncation, lost-visibility 
variant).
   
   All 23 ` + "`gobblin-temporal:test`" + ` cases pass; new ` + 
"`GobblinYarnAppLauncherTerminalGteTest`" + ` passes; ` + "`checkstyleMain`" + 
` / ` + "`checkstyleTest`" + ` clean on both modules.
   
   ### Commits
   - [x] My commits all reference JIRA issues in their subject lines.
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to