zuozhiw opened a new issue, #4070:
URL: https://github.com/apache/texera/issues/4070
### Feature Summary
Texera currently lacks systematic observability instrumentation, making it
difficult to monitor and debug the status of live services and the distributed
workflows. This feature request proposes implementing an observability solution
using OpenTelemetry standards to enable centralized logging, metrics
collection, and distributed tracing across backend services. These data can
then be integrated with open-source observability tools.
### Proposed Solution or Design
### Current Observability Gaps
**Logging**:
- Logback (Scala) and loguru (Python) with file/console output only
- Logs are ephemeral in Kubernetes (lost when pods restart)
- Cannot correlate logs across services for a single workflow execution
**Metrics**:
- No application-level metrics (request rates, error rates, latency,
database query times)
**Tracing**:
- No distributed tracing implementation
- Cannot trace a workflow execution across multiple services, Python
workers, database queries, or external API calls
**Health Checks**:
- Basic `/api/healthcheck` endpoints return `{"status": "ok"}` only
- No real health checks or detailed status
## Proposed Solution
### High-Level Approach
Add **OpenTelemetry instrumentation** throughout the codebase to emit logs,
metrics, and traces in a standardized format. These signals can then be
collected and exported to various open-source observability tools.
### Implementation Strategy
**Instrumentation Layer**:
- Add OpenTelemetry SDK to all services (Scala/Java and Python)
- Add auto-instrumentation (no code changes) where possible (HTTP, JDBC,
akka)
- Migrate current logging to use OpenTelemetry
- Based on need and use cases, add manual instrumentation for metrics and
traces
**Collection Layer**:
- Deploy OpenTelemetry Collector (as DaemonSet in Kubernetes) to collect
logs, metrics, and traces
- Collector can export to various backends (configurable, not hardcoded)
**Observability Backends**:
- The standardized OpenTelemetry data can be integrated with open-source
tools like Grafana, Elastic, etc..
### Impact / Priority
(P2)Medium – useful enhancement
### Affected Area
Deployment / Infrastructure
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]