davidzollo opened a new pull request, #10434:
URL: https://github.com/apache/seatunnel/pull/10434

   ## Summary
   
   This PR introduces **Stain Trace**, a comprehensive data lineage and 
performance tracking system for SeaTunnel. It enables end-to-end tracing of 
data records through the entire pipeline (Source → Transform → Sink), helping 
identify performance bottlenecks and analyze data flow.
   
   ## Key Features
   
   ### Core Tracing Infrastructure
   - **StainTraceEvent**: Event system for capturing trace points across 
pipeline stages
   - **StainTraceSampler**: Configurable sampling mechanism to control tracing 
overhead
   - **StainTracePayload**: Compact binary payload format for efficient 
transmission
   - **TaskMappingBuilder**: Maps tasks to readable names for trace 
visualization
   
   ### Trace Stages
   - `SOURCE_READ_DONE`: Data read from source
   - `QUEUE_IN`: Data enters intermediate queue
   - `TRANSFORM_DONE`: Transform processing complete
   - `QUEUE_OUT`: Data exits queue
   - `SINK_WRITE_START`: Sink write begins
   - `SINK_WRITE_DONE`: Sink write complete
   
   ### Trace Collector Service
   A standalone HTTP service for collecting and storing trace data:
   - **Multi-database support**: PostgreSQL, MySQL, ClickHouse
   - **REST API**: Ingest events, query traces, health checks, metrics
   - **Task mapping cache**: Enriches traces with readable task names
   - **Built-in metrics**: Track ingestion rate, errors, and performance
   
   ### Web UI Integration
   - New trace visualization page in SeaTunnel Engine UI
   - Query traces by trace_id or job_id
   - Display detailed timing and stage information
   - Identify performance bottlenecks visually
   
   ## Database Support
   
   | Database | Status | Repository Class |
   |----------|--------|-----------------|
   | PostgreSQL | ✅ Supported | `PostgresTraceRepository` |
   | MySQL | ✅ Supported | `MySqlTraceRepository` |
   | ClickHouse | ✅ Supported | `ClickHouseTraceRepository` |
   
   ## Configuration
   
   Enable stain trace in `seatunnel.yaml`:
   ```yaml
   seatunnel:
     engine:
       server-config:
         stain-trace:
           enabled: true
           sampling-rate: 0.01
           collector-url: "http://localhost:9090/ingest";
   ```
   
   ## Quick Start
   
   Comprehensive setup guide provided in:
   - `seatunnel-trace/STAIN_TRACE_QUICKSTART.md`
   
   ## Test Coverage
   
   - ✅ Unit tests for core components (StainTraceSampler, StainTracePayload, 
RecordSerializer)
   - ✅ Flow lifecycle tests (TransformFlowLifeCycleStainTraceTest)
   - ✅ Integration tests (StainTraceFlowIT)
   - ✅ Trace collector tests (payload decoder, config, task mapping cache)
   
   ## Performance Impact
   
   - **Zero overhead when disabled**: No performance impact with `enabled: 
false`
   - **Minimal overhead with sampling**: ~0.1-1% overhead with 1% sampling rate
   - **Configurable sampling**: Adjust sampling rate based on needs
   
   ## Files Changed
   
   - **Engine core**: 32 files (trace infrastructure, serialization, flow 
lifecycle)
   - **Trace collector**: 29 files (HTTP server, repository implementations, 
metrics)
   - **UI**: 8 files (trace visualization page)
   - **Examples & docs**: 4 files (quickstart guide, setup documentation)
   - **Tests**: 11 files (comprehensive test coverage)
   
   **Total**: 84 files changed, 6824 insertions(+), 42 deletions(-)
   
   ## Migration Notes
   
   - Fully backward compatible
   - Stain trace is disabled by default
   - No changes required for existing jobs
   
   ## Related Issues
   
   Addresses requirements for:
   - Data lineage tracking in distributed pipelines
   - Performance bottleneck identification
   - End-to-end latency analysis
   - Pipeline debugging and optimization
   
   ---
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to