zuozhiw opened a new issue, #4070:
URL: https://github.com/apache/texera/issues/4070

   ### Feature Summary
   
   Texera currently lacks systematic observability instrumentation, making it 
difficult to monitor and debug the status of live services and the distributed 
workflows. This feature request proposes implementing an observability solution 
using OpenTelemetry standards to enable centralized logging, metrics 
collection, and distributed tracing across backend services. These data can 
then be integrated with open-source observability tools.
   
   
   ### Proposed Solution or Design
   
   ### Current Observability Gaps
   
   **Logging**:
   - Logback (Scala) and loguru (Python) with file/console output only
   - Logs are ephemeral in Kubernetes (lost when pods restart)
   - Cannot correlate logs across services for a single workflow execution
   
   **Metrics**:
   - No application-level metrics (request rates, error rates, latency, 
database query times)
   
   **Tracing**:
   - No distributed tracing implementation
   - Cannot trace a workflow execution across multiple services, Python 
workers, database queries, or external API calls
   
   **Health Checks**:
   - Basic `/api/healthcheck` endpoints return `{"status": "ok"}` only
   - No real health checks or detailed status
   
   
   
   ## Proposed Solution
   
   ### High-Level Approach
   
   Add **OpenTelemetry instrumentation** throughout the codebase to emit logs, 
metrics, and traces in a standardized format. These signals can then be 
collected and exported to various open-source observability tools.
   
   ### Implementation Strategy
   
   **Instrumentation Layer**:
   - Add OpenTelemetry SDK to all services (Scala/Java and Python)
   - Add auto-instrumentation (no code changes) where possible (HTTP, JDBC, 
akka)
   - Migrate current logging to use OpenTelemetry
   - Based on need and use cases, add manual instrumentation for metrics and 
traces
   
   **Collection Layer**:
   - Deploy OpenTelemetry Collector (as DaemonSet in Kubernetes) to collect 
logs, metrics, and traces
   - Collector can export to various backends (configurable, not hardcoded)
   
   **Observability Backends**:
   - The standardized OpenTelemetry data can be integrated with open-source 
tools like Grafana, Elastic, etc..
   
   
   
   ### Impact / Priority
   
   (P2)Medium – useful enhancement
   
   ### Affected Area
   
   Deployment / Infrastructure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to