wu-sheng opened a new pull request, #13752:
URL: https://github.com/apache/skywalking/pull/13752
### Fix OTLP traces e2e test instability
The OTLP traces e2e test has been flaky due to infrastructure issues (not
sampling rate — that's 100% everywhere).
**Root causes identified and fixed:**
1. **No health checks on OTel demo containers** — trigger fired before
services were ready, producing no traces during the retry window.
- Added `healthcheck` with TCP checks (same pattern as base-compose OAP)
- Added `depends_on: condition: service_healthy` for proper startup
ordering
2. **Non-existent service endpoints causing 20-30s timeouts** —
`CURRENCY_SERVICE_ADDR: no.exist:80` and `FEATURE_FLAG_GRPC_SERVICE_ADDR:
no.exist:80` caused DNS resolution failures and gRPC dial timeouts on every
request, making `/api/products` slow or failing entirely.
- Changed to `productcatalogservice:3550` (reachable endpoint, fast gRPC
"unimplemented" error instead of hanging)
3. **Tight memory limits** — `productcatalogservice` at 20M and `frontend`
at 200M could OOM under CI load.
- Bumped to 40M and 300M respectively
Also adds e2e expectation specification documents (CLAUDE.md and
protocol-specific specs) for AI-assisted e2e test development.
- [ ] Explain briefly why the bug exists and how to fix it.
- The test containers had no health checks, so the e2e trigger started
calling endpoints before services were ready. Combined with DNS timeout on
non-existent service addresses, requests took 20-30s each instead of completing
quickly, starving the test of valid traces within the verify window.
- [ ] Update the [`CHANGES`
log](https://github.com/apache/skywalking/blob/master/docs/en/changes/changes.md).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]