shnapz opened a new issue, #36790:
URL: https://github.com/apache/beam/issues/36790

   ### What would you like to happen?
   
   ### Current lineage implementation
   
   Currently, the lineage implementation is tightly coupled to Beam Metrics as 
the storage backend in both Java and Python SDK. 
   
   The Java static API returns two instances for source and sink:
   ```java
   Lineage sources = Lineage.getSources();
   Lineage sinks = Lineage.getSinks();
   ```
   The `Lineage` instance provides different overloads of the `add()` method, 
which sends lineage data to metrics:
   ```java
       if (MetricsFlag.lineageRollupEnabled()) {
         ((BoundedTrie) this.metric).add(segments);
       } else {
         ((StringSet) this.metric).add(String.join("", segments));
       }
   ```
   Also `Lineage` class provides two public static methods to query Lineage 
from results:
   ```java
   Set<String> query(MetricResults results, Type type, String truncatedMarker)
   Set<String> query(MetricResults results, Type type)
   ```
   
   ### Proposed solution
   To address lineage tracking limitations, we propose a pluggable mechanism 
using the ServiceLoader pattern to decouple lineage reporting from Beam's core 
metrics infrastructure. This approach enables flexible observability
   without modifying core components. This proposal focuses on Java SDK; Python 
SDK will adopt a similar pattern in future work.
   
   1. This change must preserve current public APIs
   2. Minimize changes to I/O connectors that produce lineage; isolate changes 
to the `org.apache.beam.sdk.metrics.Lineage` class
   3. Use a plugin approach via ServiceLoader discovery, following the existing 
pattern of `FileSystemRegistrar`. Key advantage is that registrars on classpath 
can read `PipelineOptions` and turn on or turn off depending on parameters. 
This satisfies the approach described in the [Open Lineage 
ticket](https://github.com/apache/beam/issues/33981):
   ```
   options = PipelineOptions([
   '--openlineage_enabled=true',
   ```
   4. Do not provide any concrete plugin implementations. If no plugins are 
available, fall back to the existing metric-based lineage approach.
   5. Unfortunately static `query` methods expose `MetricResults` as 
implementation detail. Leave them as is, so they are out of scope of this 
change.
   
   ### Relationship to existing roadmap
   
   This change will serve as a foundation for [[Feature Request]: Integrate 
Apache Beam with
   Open Lineage](https://github.com/apache/beam/issues/33981) which is already 
put on the roadmap for Beam 3.0.
   
   ### Testing strategy
   - Unit tests for the plugin discovery mechanism
   - Integration tests with mock lineage reporters
   - Backward compatibility tests ensuring existing metric-based lineage still 
works when no plugins are present
   - Cross-runner tests will initially focus on DirectRunner, with cross-runner 
compatibility expected to be inherited from the existing metrics infrastructure
   
   ### Documentation
   - Update JavaDoc for `Lineage` class
   - Add developer guide for implementing custom lineage reporters
   
   
   ### Issue Priority
   
   Priority: 2 (default / most feature requests should be filed as P2)
   
   ### Issue Components
   
   - [ ] Component: Python SDK
   - [x] Component: Java SDK
   - [ ] Component: Go SDK
   - [ ] Component: Typescript SDK
   - [ ] Component: IO connector
   - [ ] Component: Beam YAML
   - [ ] Component: Beam examples
   - [ ] Component: Beam playground
   - [ ] Component: Beam katas
   - [ ] Component: Website
   - [ ] Component: Infrastructure
   - [ ] Component: Spark Runner
   - [ ] Component: Flink Runner
   - [ ] Component: Samza Runner
   - [ ] Component: Twister2 Runner
   - [ ] Component: Hazelcast Jet Runner
   - [ ] Component: Google Cloud Dataflow Runner


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to