JNSimba opened a new pull request, #62238:
URL: https://github.com/apache/doris/pull/62238
## Summary
- Add a new `RETRYING` status to `JobStatus` enum for streaming jobs, so
users can distinguish between healthy running jobs and jobs that are
encountering errors and auto-retrying.
- Previously, both user-initiated pause and recoverable errors shared the
same `PAUSED` status, making it impossible to tell them apart in `show
streaming jobs`.
- Now `PAUSED` is exclusively for user-initiated pause and unrecoverable
errors, while `RETRYING` indicates the job is auto-recovering with exponential
backoff.
### State transition diagram
```
CREATE JOB
|
v
+---------+
| PENDING |
+----+----+
| createStreamingTask(), autoResumeCount = 0
v
+---------+
+------------| RUNNING |------------------+
| +--+----+--+ |
| | | |
user PAUSE task fail/ | hasReachedEnd
(MANUAL_PAUSE) meta fail/ | |
| sched fail | v
| (recoverable)| +----------+
| | data quality | FINISHED |
| | error +----------+
| | (unrecoverable)
v v |
+--------+ +----------+ |
| PAUSED | | RETRYING | |
+---+----+ +----+-----+ |
| | v
user RESUME backoff, +--------+
| recreate | PAUSED |
v task +--------+
+---------+ |
| PENDING | task result
+----+----+ / \
| success fail
v | |
RUNNING v v
RUNNING RETRYING
(count=0) (keep, count++)
RETRYING --> user PAUSE --> PAUSED
any non-final --> user STOP --> STOPPED
```
### Key changes
| File | Change |
|------|--------|
| `JobStatus.java` | Add `RETRYING` enum, include in `isRunning()` |
| `AbstractJob.java` | Allow RETRYING state transitions |
| `StreamingJobSchedulerTask.java` | New `handleRetryingState()` with
backoff + task recreation; PAUSED does nothing |
| `StreamingInsertJob.java` | `onStreamTaskFail`/`fetchMeta` → RETRYING;
`onStreamTaskSuccess` → RUNNING + reset count; `gsonPostProcess` null fallback
for downgrade safety |
| `ResumeJobCommand.java` | Reject RESUME on RETRYING jobs |
| `StreamingTaskScheduler.java` | Schedule failure → RETRYING |
| `AbstractJobStatusTest.java` | Add RETRYING transition tests |
| Regression tests (5 files) | Update expected status from PAUSED to
RETRYING |
## Test plan
- [ ] UT: `AbstractJobStatusTest` covers all RETRYING state transitions
- [ ] Regression: `test_streaming_insert_job_alter_aksk` — alter to wrong
credentials, verify RETRYING status
- [ ] Regression: `test_streaming_insert_job_fetch_meta_error` — debug point
fetch meta failure, verify RETRYING
- [ ] Regression: `test_streaming_job_schedule_task_error` — debug point
schedule failure, verify RETRYING
- [ ] Regression: `test_streaming_insert_job_task_retry` — task timeout,
verify RETRYING
- [ ] Regression: CDC
`test_streaming_job_cdc_stream_postgres_latest_alter_cred` — wrong PG
credentials, verify RETRYING
🤖 Generated with [Claude Code](https://claude.com/claude-code)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]