pratapaditya04 opened a new pull request, #4197:
URL: https://github.com/apache/gobblin/pull/4197
### JIRA
- [x] My PR addresses the following Gobblin JIRA issue and references it in
the PR title.
- GOBBLIN-XXXX (follow-up to the AM shutdown-hook GTE work in commit
`dd6b26b70`)
### Description
Today GaaS relies on the GobblinTrackingEvent (GTE) stream as the source of
truth for job outcomes, but two gaps make the signal unreliable for the
temporal-on-yarn flow:
1. **Missing terminal GTEs on AM failure paths.** The previous commit on
this branch (`dd6b26b70` — Emit job-completion GTE from AM shutdown hook as
single source of truth) installed a JVM-shutdown-hook GTE emitter inside ` +
"`GobblinTemporalJobLauncher.emitJobCompletionGTE()`" + `. That covers graceful
AM shutdown. It does NOT cover JVM-fatal paths — OOM, SIGKILL, container-kill
before the hook registers, AM CLI parse error — where the AM JVM dies before
the hook can run. From GaaS's perspective the job is stuck.
2. **Wrong exit status.** Both ` +
"`GobblinTemporalApplicationMaster.main()`" + ` and ` +
"`GobblinYarnAppLauncher.main()`" + ` exited 0 on essentially every code path.
The AM only ` + "`System.exit(1)`" + `s on CLI parse failure; on actual job
failure it returned cleanly through the try-with-resources and the JVM exited
0. The launcher had no ` + "`System.exit`" + ` at all. This masked real
failures from Grid Gateway dashboards.
This PR closes both gaps:
**AM-side exit code** (` + "`gobblin-temporal`" + `)
- ` + "`GobblinTemporalJobLauncher`" + ` caches the terminal ` +
"`WorkflowExecutionStatus`" + ` into a static field, populated either by ` +
"`handleLaunchFinalization`" + ` (normal flow, while Temporal stubs are alive)
or by ` + "`emitJobCompletionGTE`" + ` (JVM shutdown hook).
- Exposes ` + "`getLastTerminalStatus()`" + ` and static ` +
"`computeExitCode(WorkflowExecutionStatus)`" + ` so the AM main can read the
cache after ` + "`close()`" + ` has shut down Temporal stubs.
- ` + "`GobblinTemporalApplicationMaster.main()`" + ` reads the cache after
` + "`start()`" + ` returns and calls ` + "`System.exit(0|1)`" + `, surfacing
job failures as non-zero AM JVM exit codes.
**GGW launcher** (` + "`gobblin-yarn/GobblinYarnAppLauncher`" + `)
- ` + "`handleApplicationReportArrivalEvent`" + ` now calls a new ` +
"`handleTerminalAppStatus`" + `, which maps ` + "`FinalApplicationStatus`" + `
→ ` + "`JOB_FAILED`" + ` / ` + "`JOB_CANCEL`" + ` (SUCCEEDED → no synthetic
GTE, since the AM hook covers it), emits with the same event names and metadata
schema as the AM-side hook so ` + "`KafkaAvroJobStatusMonitor`" + ` consumes
both uniformly, and sets the launcher's exit code to 1.
- ` + "`handleGetApplicationReportFailureEvent`" + ` exhaustion → ` +
"`JOB_FAILED`" + ` with ` + "`failureReason = GGW_LOST_AM_VISIBILITY`" + `
(distinct from ` + "`GGW_OBSERVED_AM_TERMINATION`" + `) so triage can tell the
two apart, plus ` + "`exitCode = 1`" + `.
- ` + "`main()`" + ` now ends with ` +
"`System.exit(launcher.getExitCode())`" + `.
- Emission is idempotency-guarded by an ` + "`AtomicBoolean`" + ` to handle
EventBus replays and the race between the two handlers.
Scope: temporal flow only. Shared schema is reused (` +
"`TimingEvent.LauncherTimings.{JOB_SUCCEEDED, JOB_FAILED, JOB_CANCEL}`" + `, `
+ "`TimingEvent.FlowEventConstants.{FLOW_GROUP_FIELD, FLOW_NAME_FIELD,
FLOW_EXECUTION_ID_FIELD, JOB_NAME_FIELD, JOB_GROUP_FIELD}`" + `, ` +
"`ConfigurationKeys.{FLOW_GROUP_KEY, ...}`" + `).
### Tests
- [x] My PR adds the following unit tests.
**` + "`GobblinTemporalJobLauncherTest`" + `** (7 new cases):
- ` + "`testComputeExitCodeForCompletedReturnsZero`" + ` / ` +
"`testComputeExitCodeForAnyNonCompletedReturnsOne`" + ` cover the AM main's
static exit-code helper for every ` + "`WorkflowExecutionStatus`" + `.
- ` + "`testHandleLaunchFinalizationPopulatesLastTerminalStatus`" + `
verifies normal-flow status capture.
- ` + "`testLastTerminalStatusSurvivesClose`" + ` ensures the cache outlives
` + "`close()`" + ` shutting down Temporal stubs (this is what the AM main
relies on).
- ` + "`testEmitJobCompletionGTEDoesNotReQueryWhenStatusAlreadyCached`" + `
verifies the hook reuses the cached value rather than re-querying Temporal
during shutdown.
- ` + "`testConstructorResetsStaleTerminalStatus`" + ` guards against
cross-test cache leakage (matters for any JVM that reuses launchers).
**` + "`GobblinYarnAppLauncherTerminalGteTest`" + `** (new file, 16 cases):
focused unit tests that avoid the heavy ` + "`MiniYARNCluster`" + ` + Helix
integration setup used by ` + "`GobblinYarnAppLauncherTest`" + `, by injecting
fields with reflection on a CALLS_REAL_METHODS Mockito stand-in. Covers ` +
"`mapFinalAppStatusToEventName`" + ` for every ` + "`FinalApplicationStatus`" +
`, ` + "`handleTerminalAppStatus`" + ` exit-code wiring + idempotency, ` +
"`handleLostAmVisibility`" + ` exit-code + idempotency + interaction with
already-emitted GTEs, and ` + "`buildLauncherTerminalMetadata`" + ` (flow-id
population from ` + "`Config`" + `, diagnostics truncation, lost-visibility
variant).
All 23 ` + "`gobblin-temporal:test`" + ` cases pass; new ` +
"`GobblinYarnAppLauncherTerminalGteTest`" + ` passes; ` + "`checkstyleMain`" +
` / ` + "`checkstyleTest`" + ` clean on both modules.
### Commits
- [x] My commits all reference JIRA issues in their subject lines.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]